IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit fd57a98d6f0c98fa295813087f13afb26c224e73 upstream.
When we have smack enabled, during the creation of a directory smack may
attempt to add a "smack transmute" xattr on the inode, which results in
the following warning and trace:
WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0
Modules linked in: nft_objref nf_conntrack_netbios_ns (...)
CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:start_transaction+0x489/0x4f0
Code: e9 be fc ff ff (...)
RSP: 0018:ffffc90001887d10 EFLAGS: 00010202
RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003
RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000
RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004
R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2
R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000
FS: 00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? slab_free_freelist_hook+0xea/0x1b0
? trace_hardirqs_on+0x1c/0xe0
btrfs_setxattr_trans+0x3c/0xf0
__vfs_setxattr+0x63/0x80
smack_d_instantiate+0x2d3/0x360
security_d_instantiate+0x29/0x40
d_instantiate_new+0x38/0x90
btrfs_mkdir+0x1cf/0x1e0
vfs_mkdir+0x14f/0x200
do_mkdirat+0x6d/0x110
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f673196ae6b
Code: 8b 05 11 (...)
RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d
RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000
R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0
irq event stamp: 11029
hardirqs last enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670
hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670
softirqs last enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
This happens because at btrfs_mkdir() we call d_instantiate_new() while
holding a transaction handle, which results in the following call chain:
btrfs_mkdir()
trans = btrfs_start_transaction(root, 5);
d_instantiate_new()
smack_d_instantiate()
__vfs_setxattr()
btrfs_setxattr_trans()
btrfs_start_transaction()
start_transaction()
WARN_ON()
--> a tansaction start has TRANS_EXTWRITERS
set in its type
h->orig_rsv = h->block_rsv
h->block_rsv = NULL
btrfs_end_transaction(trans)
Besides the warning triggered at start_transaction, we set the handle's
block_rsv to NULL which may cause some surprises later on.
So fix this by making btrfs_setxattr_trans() not start a transaction when
we already have a handle on one, stored in current->journal_info, and use
that handle. We are good to use the handle because at btrfs_mkdir() we did
reserve space for the xattr and the inode item.
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
CC: stable@vger.kernel.org # 5.4+
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4f6a49de64fd1b1dba5229c02047376da7cf24fd upstream.
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes: a7f8b1c2ac21 ("btrfs: file: reserve qgroup space after the hole punch range is locked")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0f9c03d824f6f522d3bc43629635c9765546ebc5 upstream.
Following commit f218ea6c4792 ("btrfs: delayed-inode: Remove wrong
qgroup meta reservation calls") this function now reserves num_bytes,
rather than the fixed amount of nodesize. As such this requires the
same amount to be freed in case of failure. Fix this by adjusting
the amount we are freeing.
Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5011c5a663b9c6d6aff3d394f11049b371199627 upstream.
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c17916510428dbccdf657de050c34e208347089 upstream.
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable@vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3660d0bcdb82807d434da9d2e57d88b37331182d upstream.
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.
The conditions for this to happen are the following:
1) We have a file with a size of, for example, 1280K;
2) There is a written (non-prealloc) extent for the file range from 1024K
to 1280K with a length of 256K;
3) This particular file extent layout is durably persisted, so that the
existing superblock persisted on disk points to a subvolume root where
the file has that exact file extent layout and state;
4) The file is truncated to a smaller size, to an offset lower than the
start offset of its last extent, for example to 800K. The truncate sets
the full sync runtime flag on the inode;
6) Fsync the file to log it and clear the full sync runtime flag;
7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
into the file with a destination offset that starts at or beyond the
256K file extent item we had - for example to offset 1024K;
8) Since the clone operation does not find extents in the source range,
we end up in the if branch at the bottom of btrfs_clone() where we
punch a hole for the file range starting at offset 1024K by calling
btrfs_replace_file_extents(). There we end up not setting the full
sync flag on the inode, because we don't know we are being called in
a clone context (and not fallocate's punch hole operation), and
neither do we create an extent map to represent a hole because the
requested range is beyond eof;
9) A further fsync to the file will be a fast fsync, since the clone
operation did not set the full sync flag, and therefore it relies on
modified extent maps to correctly log the file layout. But since
it does not find any extent map marking the range from 1024K (the
previous eof) to the new eof, it does not log a file extent item
for that range representing the hole;
10) After a power failure no hole for the range starting at 1024K is
punched and we end up exposing stale data from the old 256K extent.
Turning this into exact steps:
$ mkfs.btrfs -f -O no-holes /dev/sdi
$ mount /dev/sdi /mnt
# Create our test file with 3 extents of 256K and a 256K hole at offset
# 256K. The file has a size of 1280K.
$ xfs_io -f -s \
-c "pwrite -S 0xab -b 256K 0 256K" \
-c "pwrite -S 0xcd -b 256K 512K 256K" \
-c "pwrite -S 0xef -b 256K 768K 256K" \
-c "pwrite -S 0x73 -b 256K 1024K 256K" \
/mnt/sdi/foobar
# Make sure it's durably persisted. We want the last committed super
# block to point to this particular file extent layout.
sync
# Now truncate our file to a smaller size, falling within a position of
# the second extent. This sets the full sync runtime flag on the inode.
# Then fsync the file to log it and clear the full sync flag from the
# inode. The third extent is no longer part of the file and therefore
# it is not logged.
$ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
# Now do a clone operation that only clones the hole and sets back the
# file size to match the size it had before the truncate operation
# (1280K).
$ xfs_io \
-c "reflink /mnt/foobar 256K 1024K 256K" \
-c "fsync" \
/mnt/foobar
# File data before power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1310720
<power fail>
# Mount the fs again to replay the log tree.
$ mount /dev/sdi /mnt
# File data after power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
*
1310720
The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.
The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.
So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.
A test case for fstests will follow soon.
Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd0734f2a866f9d619d4abf97c3d71bcdee40ea9 upstream.
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 195a49eaf655eb914896c92cecd96bc863c9feb3 upstream.
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d70cef0d46729808dc53f145372c02b145c92604 upstream.
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 95c85fba1f64c3249c67f0078a29f8a125078189 upstream.
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4f4317c13a40194940acf4a71670179c4faca2b5 ]
While doing error injection I would sometimes get a corrupt file system.
This is because I was injecting errors at btrfs_search_slot, but would
only do it one time per stack. This uncovered a problem in
commit_fs_roots, where if we get an error we would just break. However
we're in a nested loop, the first loop being a loop to find all the
dirty fs roots, and then subsequent root updates would succeed clearing
the error value.
This isn't likely to happen in real scenarios, however we could
potentially get a random ENOMEM once and then not again, and we'd end up
with a corrupted file system. Fix this by moving the error checking
around a bit to the main loop, as this is the only place where something
will fail, and return the error as soon as it occurs.
With this patch my reproducer no longer corrupts the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 72c9925f87c8b74f36f8e75a4cd93d964538d3ca upstream.
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes: be20aa9dbadc8c ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 81e75ac74ecba929d1e922bf93f9fc467232e39f upstream.
My recent patch set "A variety of lock contention fixes", found here
https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/
(Tracked in https://github.com/btrfs/linux/issues/86)
that reduce lock contention on the extent root by running delayed refs
less often resulted in a regression in generic/371. This test
fallocate()'s the fs until it's full, deletes all the files, and then
tries to fallocate() until full again.
Before these patches we would run all of the delayed refs during
flushing, and then would commit the transaction because we had plenty of
pinned space to recover in order to allocate. However my patches made
it so we weren't running the delayed refs as aggressively, which meant
that we appeared to have less pinned space when we were deciding to
commit the transaction.
We use the space_info->total_bytes_pinned to approximate how much space
we have pinned. It's approximate because if we remove a reference to an
extent we may free it, but there may be more references to it than we
know of at that point, but we account it as pinned at the creation time,
and then it's properly accounted when the delayed ref runs.
The way we account for pinned space is if the
delayed_ref_head->total_ref_mod is < 0, because that is clearly a
freeing option. However there is another case, and that is where
->total_ref_mod == 0 && ->must_insert_reserved == 1.
When we allocate a new extent, we have ->total_ref_mod == 1 and we have
->must_insert_reserved == 1. This is used to indicate that it is a
brand new extent and will need to have its extent entry added before we
modify any references on the delayed ref head. But if we subsequently
remove that extent reference, our ->total_ref_mod will be 0, and that
space will be pinned and freed. Accounting for this case properly
allows for generic/371 to pass with my delayed refs patches applied.
It's important to note that this problem exists without the referenced
patches, it just was uncovered by them.
CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2187374f35fe9cadbddaa9fcf0c4121365d914e8 upstream.
Currently we pass things around to figure out if we maybe freeing data
based on the state of the delayed refs head. This makes the accounting
sort of confusing and hard to follow, as it's distinctly separate from
the delayed ref heads stuff, but also depends on it entirely.
Fix this by explicitly adjusting the space_info->total_bytes_pinned in
the delayed refs code. We now have two places where we modify this
counter, once where we create the delayed and destroy the delayed refs,
and once when we pin and unpin the extents. This means there is a
slight overlap between delayed refs and the pin/unpin mechanisms, but
this is simply used by the ENOSPC infrastructure to determine if we need
to commit the transaction, so there's no adverse affect from this, we
might simply commit thinking it will give us enough space when it might
not.
CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 938fcbfb0cbcf532a1869efab58e6009446b1ced upstream.
While doing error injection testing with my relocation patches I hit the
following assert:
assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3357!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a
RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
Call Trace:
btrfs_free_block_groups.cold+0x55/0x55
close_ctree+0x2c5/0x306
? fsnotify_destroy_marks+0x14/0x100
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1b1/0x1d0
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happened because I injected an error in btrfs_cow_block() while
running the dirty block groups. When we run the dirty block groups, we
splice the list onto a local list to process. However if an error
occurs, we only cleanup the transactions dirty block group list, not any
pending block groups we have on our locally spliced list.
In fact if we fail to allocate a path in this function we'll also fail
to clean up the splice list.
Fix this by splicing the list back onto the transaction dirty block
group list so that the block groups are cleaned up. Then add a 'out'
label and have the error conditions jump to out so that the errors are
handled properly. This also has the side-effect of fixing a problem
where we would clear 'ret' on error because we unconditionally ran
btrfs_run_delayed_refs().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c78a10aebb275c38d0cfccae129a803fe622e305 upstream.
When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later. The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.
This exit condition simply isn't correct really. During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly. During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.
Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately. Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 867ed321f90d06aaba84e2c91de51cd3038825ef upstream.
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing. However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:
btrfs_inc_extent_ref
btrfs_copy_root
create_reloc_root
btrfs_init_reloc_root
btrfs_record_root_in_trans
btrfs_start_transaction
btrfs_update_inode
btrfs_update_time
touch_atime
file_accessed
btrfs_file_mmap
This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem. Fix this by aborting the transaction if we fail
to do the reference modification.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eddda68d97732ce05ca145f8e85e8a447f65cdad upstream.
A weird KASAN problem that Zygo reported could have been easily caught
if we checked for basic things in our backref freeing code. We have two
methods of freeing a backref node
- btrfs_backref_free_node: this just is kfree() essentially.
- btrfs_backref_drop_node: this actually unlinks the node and cleans up
everything and then calls btrfs_backref_free_node().
We should mostly be using btrfs_backref_drop_node(), to make sure the
node is properly unlinked from the backref cache, and only use
btrfs_backref_free_node() when we know the node isn't actually linked to
the backref cache. We made a mistake here and thus got the KASAN splat.
Make this style of issue easier to find by adding some ASSERT()'s to
btrfs_backref_free_node() and adjusting our deletion stuff to properly
init the list so we can rely on list_empty() checks working properly.
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427
RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001
R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff888112402900
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 80 bytes inside of
128-byte region [ffff888112402900, ffff888112402980)
The buggy address belongs to the page:
page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402
flags: 0x17ffe0000000200(slab)
raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440
raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f78743fbdae1bb31bc9c9233c3590a5048782381 upstream.
The backref code is looking for a reloc_root that corresponds to the
given fs root. However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item. This would
result in no corresponding reloc_root being in the reloc root cache, and
thus would return NULL when we do the find_reloc_root() call.
Because of this we do not want to WARN_ON(). This presumably was meant
to catch developer errors, cases where we messed up adding the reloc
root. However we can easily hit this case with error injection, and
thus should not do a WARN_ON().
CC: stable@vger.kernel.org # 5.10+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e2a870a599d4699a626ec26430c7a1ab14a2a49 upstream.
Zygo reported the following panic when testing my error handling patches
for relocation:
kernel BUG at fs/btrfs/backref.c:2545!
invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14
Hardware name: QEMU Standard PC (i440FX + PIIX,
Call Trace:
btrfs_backref_error_cleanup+0x4df/0x530
build_backref_tree+0x1a5/0x700
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
? free_extent_buffer.part.52+0xd7/0x140
relocate_tree_blocks+0x2a6/0xb60
? kasan_unpoison_shadow+0x35/0x50
? do_relocation+0xc10/0xc10
? kasan_kmalloc+0x9/0x10
? kmem_cache_alloc_trace+0x6a3/0xcb0
? free_extent_buffer.part.52+0xd7/0x140
? rb_insert_color+0x342/0x360
? add_tree_block.isra.36+0x236/0x2b0
relocate_block_group+0x2eb/0x780
? merge_reloc_roots+0x470/0x470
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x18f0
? pvclock_clocksource_read+0xeb/0x190
? btrfs_relocate_chunk+0x120/0x120
? lock_contended+0x620/0x6e0
? do_raw_spin_lock+0x1e0/0x1e0
? do_raw_spin_unlock+0xa8/0x140
btrfs_ioctl_balance+0x1f9/0x460
btrfs_ioctl+0x24c8/0x4380
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurs because of this check
if (RB_EMPTY_NODE(&upper->rb_node))
BUG_ON(!list_empty(&node->upper));
As we are dropping the backref node, if we discover that our upper node
in the edge we just cleaned up isn't linked into the cache that we are
now done with this node, thus the BUG_ON().
However this is an erroneous assumption, as we will look up all the
references for a node first, and then process the pending edges. All of
the 'upper' nodes in our pending edges won't be in the cache's rb_tree
yet, because they haven't been processed. We could very well have many
edges still left to cleanup on this node.
The fact is we simply do not need this check, we can just process all of
the edges only for this node, because below this check we do the
following
if (list_empty(&upper->lower)) {
list_add_tail(&upper->lower, &cache->leaves);
upper->lowest = 1;
}
If the upper node truly isn't used yet, then we add it to the
cache->leaves list to be cleaned up later. If it is still used then the
last child node that has it linked into its node will add it to the
leaves list and then it will be cleaned up.
Fix this problem by dropping this logic altogether. With this fix I no
longer see the panic when testing with error injection in the backref
code.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 951c80f83d61bd4b21794c8aba829c3c1a45c2d0 ]
Commit dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could
span across a page") make btrfs_invalidapage() to search all ordered
extents.
The offending code looks like this:
again:
start = page_start;
ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
if (ordred) {
end = min(page_end,
ordered->file_offset + ordered->num_bytes - 1);
/* Do the cleanup */
start = end + 1;
if (start < page_end)
goto again;
}
The behavior is indeed necessary for the incoming subpage support, but
when it iterates through all the ordered extents, it also resets the
search range @start.
This means, for the following cases, we can double account the ordered
extents, causing its bytes_left underflow:
Page offset
0 16K 32K
|<--- OE 1 --->|<--- OE 2 ---->|
As the first iteration will find ordered extent (OE) 1, which doesn't
cover the full page, thus after cleanup code, we need to retry again.
But again label will reset start to page_start, and we got OE 1 again,
which causes double accounting on OE 1, and cause OE 1's byte_left to
underflow.
This problem can only happen for subpage case, as for regular sectorsize
== PAGE_SIZE case, we will always find a OE ends at or after page end,
thus no way to trigger the problem.
Move the again label after start = page_start. There will be more
comprehensive rework to convert the open coded loop to a proper while
loop for subpage support.
Fixes: dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span across a page")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cc64e7ebfb0d7faaba2438334c43466955a96e8 ]
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.
Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
end up triggering an assertion and crashing. Example reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
# Do a direct IO write with O_DSYNC into a non-aligned range...
xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
umount $MNT
When running the reproducer an assertion fails and produces the following
trace:
[ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
[ 2418.403745] ------------[ cut here ]------------
[ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
[ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G D 5.10.15-btrfs-next-87 #1
[ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
[ 2418.407835] Code: e6 48 c7 (...)
[ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
[ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
[ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
[ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
[ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
[ 2418.412713] FS: 00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
[ 2418.413326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
[ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2418.415669] Call Trace:
[ 2418.416254] btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
[ 2418.416812] btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
[ 2418.417380] btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
[ 2418.418315] btrfs_file_write_iter+0x2a9/0x770 [btrfs]
[ 2418.418920] new_sync_write+0x11f/0x1c0
[ 2418.419430] vfs_write+0x2bb/0x3b0
[ 2418.419972] __x64_sys_pwrite64+0x90/0xc0
[ 2418.420486] do_syscall_64+0x33/0x80
[ 2418.420979] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2418.421486] RIP: 0033:0x7f54fda0b986
[ 2418.421981] Code: 48 c7 c0 (...)
[ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
[ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
[ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
[ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
[ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
[ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
[ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
1) At btrfs_file_write_iter() we set current->journal_info to
BTRFS_DIO_SYNC_STUB;
2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
3) We can't do the direct IO write because it starts at a non-aligned
offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
check_direct_IO() which does the alignment check), but we leave
current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
at btrfs_dio_iomap_begin(), because we assume we always get there;
4) Then at __btrfs_direct_write() we see that the attempt to do the
direct IO write was not successful, 0 bytes written, so we fallback
to a buffered write by calling btrfs_buffered_write();
5) There we call btrfs_check_data_free_space() which in turn calls
btrfs_alloc_data_chunk_ondemand() and that calls
btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
enum btrfs_reserve_flush_enum flush)
{
struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
int ret;
ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
(...)
So fix that by setting the journal to NULL whenever check_direct_IO()
returns a failure.
This bug only affects 5.10 kernels, and the regression was introduced in
5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
patchset that went into the merge window for 5.11. So this is a fix only
for 5.10.x stable kernels, as there are people hitting this bug.
Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
CC: stable@vger.kernel.org # 5.10 (and only 5.10)
Acked-by: David Sterba <dsterba@suse.com>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's a mistake in backport of upstream commit 2175bf57dc95 ("btrfs:
fix possible free space tree corruption with online conversion") as
5.10.13 commit 2175bf57dc95.
The enum value BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED has been added to the
wrong enum set, colliding with value of BTRFS_FS_QUOTA_ENABLE. This
could cause problems during the tree conversion, where the quotas
wouldn't be set up properly but the related code executed anyway due to
the bit set.
Link: https://lore.kernel.org/linux-btrfs/20210219111741.95DD.409509F4@e16-tech.com
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
CC: stable@vger.kernel.org # 5.10.13+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2f96e40212d435b328459ba6b3956395eed8fa9f upstream.
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c41ec4529d3448df8998950d7bada757a1b321cf upstream.
This effectively reverts commit d5c8238849e7 ("btrfs: convert
data_seqcount to seqcount_mutex_t").
While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):
[66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5
[66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
[66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
[66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
[66.441485] Call Trace:
[66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs]
[66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
[66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs]
[66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441619] ? __this_cpu_preempt_check+0xf/0x11
[66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs]
[66.441700] ? __lock_acquire+0x35f/0x2650
[66.441717] ? lockdep_hardirqs_on+0x87/0x120
[66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
[66.441724] ? call_rcu+0x2d3/0x530
[66.441731] ? __might_fault+0x41/0x90
[66.441736] ? kvm_sched_clock_read+0x15/0x50
[66.441740] ? sched_clock+0x8/0x10
[66.441745] ? sched_clock_cpu+0x13/0x180
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441768] __ia32_sys_ioctl+0x165/0x8a0
[66.441773] ? __this_cpu_preempt_check+0xf/0x11
[66.441785] ? __might_fault+0x89/0x90
[66.441791] __do_fast_syscall_32+0x54/0x80
[66.441796] do_fast_syscall_32+0x32/0x70
[66.441801] do_SYSENTER_32+0x15/0x20
[66.441805] entry_SYSENTER_32+0x9f/0xf2
[66.441808] EIP: 0xab7b5549
[66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
[66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
[66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[66.441833] irq event stamp: 42579
[66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
[66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
[66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
[66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
========================================================================
btrfs_remove_chunk+0x58b/0x7b0:
__seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
(inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
(inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
========================================================================
The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.
After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init(). In this scenario, its second
parameter info->chunk_mutex is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.
The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.
Commit d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 71008734d27f2276fcef23a5e546d358430f2d52 ]
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid. Fix this helper to take the key
so we have access to the offset when we need it.
Fixes: 457f1864b569 ("btrfs: pretty print leaked root name")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 518837e65068c385dddc0a87b3e577c8be7c13b1 upstream.
When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5f4 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).
However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.
Example reproducer:
$ cat test-send-clone.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
# Create our test file with a single and large extent (1M) and with
# different content for different file ranges that will be reflinked
# later.
xfs_io -f \
-c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 128K 128K" \
-c "pwrite -S 0xef 256K 256K" \
-c "pwrite -S 0x1a 512K 512K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Now do a series of changes to our file such that we end up with
# different parts of the extent reflinked into different file offsets
# and we overwrite a large part of the extent too, so no file extent
# items refer to that part that was overwritten. This used to confuse
# the algorithm used by the kernel to figure out which file ranges to
# clone, making it attempt to clone from a source range starting at
# the current eof of the file, resulting in the receiver to fail since
# it is an invalid clone operation.
#
xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
-c "reflink $MNT/foobar 0K 512K 256K" \
-c "reflink $MNT/foobar 512K 128K 256K" \
-c "pwrite -S 0x73 384K 640K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
echo -e "\nFile digest in the original filesystem:"
md5sum $MNT/snap2/foobar
# Now unmount the filesystem, create a new one, mount it and try to
# apply both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT
# Must match what we got in the original filesystem of course.
echo -e "\nFile digest in the new filesystem:"
md5sum $MNT/snap2/foobar
umount $MNT
When running the reproducer, the incremental send operation fails due to
an invalid clone operation:
$ ./test-send-clone.sh
wrote 131072/131072 bytes at offset 0
128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
wrote 131072/131072 bytes at offset 131072
128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
wrote 524288/524288 bytes at offset 524288
512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
linked 983040/983040 bytes at offset 1048576
960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
linked 262144/262144 bytes at offset 524288
256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
linked 262144/262144 bytes at offset 131072
256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
wrote 655360/655360 bytes at offset 393216
640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
File digest in the original filesystem:
9c13c61cb0b9f5abf45344375cb04dfa /mnt/sdi/snap2/foobar
At subvol snap1
At snapshot snap2
ERROR: failed to clone extents to foobar: Invalid argument
File digest in the new filesystem:
132f0396da8f48d2e667196bff882cfc /mnt/sdi/snap2/foobar
The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.
For the example above, what happens is the following:
1) When processing the extent at file offset 1M, the algorithm checks that
the extent is shared and can be (fully or partially) found at file
offset 0.
At this point the file has a size (and eof) of 1M at the receiver;
2) It finds that our extent item at file offset 1M has a data offset of
64K and, since the file extent item at file offset 0 has a data offset
of 0, it issues a clone operation, from the same file and root, that
has a source range offset of 64K, destination offset of 1M and a length
of 64K, since the extent item at file offset 0 refers only to the first
128K of the shared extent.
After this clone operation, the file size (and eof) at the receiver is
increased from 1M to 1088K (1M + 64K);
3) Now there's still 896K (960K - 64K) of data left to clone or write, so
it checks for the next file extent item, which starts at file offset
128K. This file extent item has a data offset of 0 and a length of
256K, so a clone operation with a source range offset of 256K, a
destination offset of 1088K (1M + 64K) and length of 128K is issued.
After this operation the file size (and eof) at the receiver increases
from 1088K to 1216K (1088K + 128K);
4) Now there's still 768K (896K - 128K) of data left to clone or write, so
it checks for the next file extent item, located at file offset 384K.
This file extent item points to a different extent, not the one we want
to clone, with a length of 640K. So we issue a write operation into the
file range 1216K (1088K + 128K, end of the last clone operation), with
a length of 640K and with a data matching the one we can find for that
range in send root.
After this operation, the file size (and eof) at the receiver increases
from 1216K to 1856K (1216K + 640K);
5) Now there's still 128K (768K - 640K) of data left to clone or write, so
we look into the file extent item, which is for file offset 1M and it
points to the extent we want to clone, with a data offset of 64K and a
length of 960K.
However this matches the file offset we started with, the start of the
range to clone into. So we can't for sure find any file extent item
from here onwards with the rest of the data we want to clone, yet we
proceed and since the file extent item points to the shared extent,
with a data offset of 64K, we issue a clone operation with a source
range starting at file offset 1856K, which matches the file extent
item's offset, 1M, plus the amount of data cloned and written so far,
which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
operation is invalid since the source range offset matches the current
eof of the file in the receiver. We should have stopped looking for
extents to clone at this point and instead fallback to write, which
would simply the contain the data in the file range from 1856K to
1856K + 128K.
So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.
A test case for fstests will follow soon.
Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c113e ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 34d1eb0e599875064955a74712f08ff14c8e3d5f upstream.
If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly. Fix this by only running the delayed refs
if there was no failure.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 49ecc679ab48b40ca799bf94b327d5284eac9e46 upstream.
Zygo reported the following KASAN splat:
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache(). This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up. However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache. Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.
Fixes: 75bfb9aff45e ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18d3bff411c8d46d40537483bdc0b61b33ce0371 upstream.
This was partially fixed by f3e3d9cc3525 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction. The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.
Fixes: f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]
btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
ktime.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]
Might happen that bg->discard_eligible_time was changed without
rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
time, peek_discard_list() returns NULL, and all work halts and goes to
sleep without further rescheduling even there are block groups to
discard.
It happens pretty often, but not so visible from the userspace because
after some time it usually will be kicked off anyway by someone else
calling btrfs_discard_reschedule_work().
Fix it by continue rescheduling if block group discard lists are not
empty.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.
Pass tree->fs_info as it's known to be valid in all cases.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.
[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.
Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.
The dmesg would look like:
[438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
[438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
[450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
[463.501781] BTRFS info (device dm-10): balance: ended with status: -2
[CAUSE]
The ENOENT error is returned from the following call chain:
add_data_references()
|- delete_v1_space_cache();
|- if (!found)
return -ENOENT;
The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.
With extra debugging, the offending tree block looks like this:
leaf bytenr = 42676709441536, data_bytenr = 34626327621632
ctime 1567904822.739884119 (2019-09-08 03:07:02)
mtime 0.0 (1970-01-01 01:00:00)
otime 0.0 (1970-01-01 01:00:00)
item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
generation 1517381 type 2 (prealloc)
prealloc data disk byte 34626327621632 nr 262144 <<<
prealloc data offset 0 nr 262144
item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
uuid d0d4361f-d231-6d40-8901-fe506e4b2b53
Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.
[FIX]
The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
btrfs_find_all_leafs to locate data extent parent tree leaves"), before
that commit, we use something like
"if (type == BTRFS_FILE_EXTENT_INLINE) continue;"
But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
ignoring BTRFS_FILE_EXTENT_PREALLOC.
Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.
Reported-by: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
CC: stable@vger.kernel.org # 5.6+
Tested-By: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]
Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot. However this introduced a pretty serious
performance regression. To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.
The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush. The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.
We cannot revert this patch unfortunately because of 3d45f221ce62
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.
Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us. This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe. I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
5.9 0m54.258s
5.10 1m26.212s
5.10+patch 0m38.800s
5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.
Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:
5.10.5 4m00s
5.10.5+patch 1m08s
5.11-rc2 5m14s
5.11-rc2+patch 1m30s
Reported-by: René Rebe <rene@exactcode.de>
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add my test results ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]
Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.
The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.
This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):
$ cat test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 200 40
umount $MNT
The results before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494
Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms
The results after this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037
Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms
+10.8% throughput, -8.2% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.
When doing an incremental send, if we have a new inode that happens to
have the same number that an old directory inode had in the base snapshot
and that old directory has a pending rmdir operation, we end up computing
a wrong path for the new inode, causing the receiver to fail.
Example reproducer:
$ cat test-send-rmdir.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
mkdir $MNT/dir
touch $MNT/dir/file1
touch $MNT/dir/file2
touch $MNT/dir/file3
# Filesystem looks like:
#
# . (ino 256)
# |----- dir/ (ino 257)
# |----- file1 (ino 258)
# |----- file2 (ino 259)
# |----- file3 (ino 260)
#
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Now remove our directory and all its files.
rm -fr $MNT/dir
# Unmount the filesystem and mount it again. This is to ensure that
# the next inode that is created ends up with the same inode number
# that our directory "dir" had, 257, which is the first free "objectid"
# available after mounting again the filesystem.
umount $MNT
mount $DEV $MNT
# Now create a new file (it could be a directory as well).
touch $MNT/newfile
# Filesystem now looks like:
#
# . (ino 256)
# |----- newfile (ino 257)
#
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
# Now unmount the filesystem, create a new one, mount it and try to apply
# both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT
umount $MNT
When running the test, the receive operation for the incremental stream
fails:
$ ./test-send-rmdir.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: chown o257-9-0 failed: No such file or directory
So fix this by tracking directories that have a pending rmdir by inode
number and generation number, instead of only inode number.
A test case for fstests follows soon.
Reported-by: Massimo B. <massimo.b@gmx.net>
Tested-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ae5e070eaca9dbebde3459dd8f4c2756f8c097d0 upstream.
There is a chance of racing for qgroup flushing which may lead to
deadlock:
Thread A | Thread B
(not holding trans handle) | (holding a trans handle)
--------------------------------+--------------------------------
__btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta()
|- try_flush_qgroup() | |- try_flush_qgroup()
|- QGROUP_FLUSHING bit set | |
| | |- test_and_set_bit()
| | |- wait_event()
|- btrfs_join_transaction() |
|- btrfs_commit_transaction()|
!!! DEAD LOCK !!!
Since thread A wants to commit transaction, but thread B is holding a
transaction handle, blocking the commit.
At the same time, thread B is waiting for thread A to finish its commit.
This is just a hot fix, and would lead to more EDQUOT when we're near
the qgroup limit.
The proper fix would be to make all metadata/data reservations happen
without holding a transaction handle.
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream.
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario
commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.
The race can happen like:
1) The caching thread was running using the old commit root when it
found the extent for [2, 3);
2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);
3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).
So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).
In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).
Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.
While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space. This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage. This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.
Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/BGBQACgkQxWXV+ddt
WDtRYQ/+IUGjJ4l6MyL3PwLgTVabKsSpm2R3y3M/0tVJ0FIhDXYbkpjB2CkpIdcH
jUvHnRL4H59hwG6rwlXU2oC298FNbbrLGIU+c9DR50RuyQCDGnT02XvxwfDIEFzp
WLNZ/CAhRkm6boj//70lV26BpeXT59KMYwNixfCNTXq9Ir0qYHHCGg4cEBQLS++2
JUU8XVTLURIYiFOLbwmABI7V43OgDhdORIr+qnR8xjCUyhusZsjVVbvIdW3BDi/S
wK7NJsfuqgsF0zD9URJjpwTFiJL9SvBLWR8JnM9NiLW3ZbkGBL+efL6mdWuH7534
gruJRS2zYPMO2/Kpjy/31CWLap3PUSD3i8cKF+uo3liojWuSUhN8kfvcNxJVd5se
NkEK+4zOjsDIVbv7gcjThSv4KTnOUO/XfN9TWUMuduaMBmGQNaQut1FpGV98utiK
yW6x8xqcR4SI+lqY6ILqCK+qUHf19BLSsuyzZdTIontKKRA9F9hY8a4XTZuzTWml
BGYmFGP640vOo8C9GjrQfpAwa7CB/DnF/cg1AAmuZ8vrEm9zYjmauFKK8ZPcveA3
KGrnmIlYjhAIX16oRbfwOgj9D2xa1loBzJyHQHByvCMXGVFBnqRTRANRHFrdQWJB
qh9+J4EJcUXPE9WGHxAW/g9vpFkV7IRABHs7aUB8zApxI9nGA0Q=
=kcxn
-----END PGP SIGNATURE-----
Merge tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for various warnings that accumulated over past two weeks:
- tree-checker: add missing return values for some errors
- lockdep fixes
- when reading qgroup config and starting quota rescan
- reverse order of quota ioctl lock and VFS freeze lock
- avoid accessing potentially stale fs info during device scan,
reported by syzbot
- add scope NOFS protection around qgroup relation changes
- check for running transaction before flushing qgroups
- fix tracking of new delalloc ranges for some cases"
* tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lockdep splat when enabling and disabling qgroups
btrfs: do nofs allocations when adding and removing qgroup relations
btrfs: fix lockdep splat when reading qgroup config on mount
btrfs: tree-checker: add missing returns after data_ref alignment checks
btrfs: don't access possibly stale fs_info data for printing duplicate device
btrfs: tree-checker: add missing return after error in root_item
btrfs: qgroup: don't commit transaction when we already hold the handle
btrfs: fix missing delalloc new bit for new delalloc ranges
When adding or removing a qgroup relation we are doing a GFP_KERNEL
allocation which is not safe because we are holding a transaction
handle open and that can make us deadlock if the allocator needs to
recurse into the filesystem. So just surround those calls with a
nofs context.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>