8547 Commits

Author SHA1 Message Date
Filipe Manana
dca4f29507 btrfs: fix extent buffer leak on failure to copy root
commit 72c9925f87c8b74f36f8e75a4cd93d964538d3ca upstream.

At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().

So fix that by unlocking the extent buffer and dropping our reference on
it before returning.

Fixes: be20aa9dbadc8c ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:44 +01:00
Josef Bacik
df369c3afb btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
commit 938fcbfb0cbcf532a1869efab58e6009446b1ced upstream.

While doing error injection testing with my relocation patches I hit the
following assert:

  assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3357!
  invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 24351 Comm: umount Tainted: G        W         5.10.0-rc3+ #193
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x1a
  RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
  RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
  RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
  RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
  R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
  R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
  FS:  00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
  Call Trace:
   btrfs_free_block_groups.cold+0x55/0x55
   close_ctree+0x2c5/0x306
   ? fsnotify_destroy_marks+0x14/0x100
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20
   deactivate_locked_super+0x36/0xa0
   cleanup_mnt+0x12d/0x190
   task_work_run+0x5c/0xa0
   exit_to_user_mode_prepare+0x1b1/0x1d0
   syscall_exit_to_user_mode+0x54/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happened because I injected an error in btrfs_cow_block() while
running the dirty block groups.  When we run the dirty block groups, we
splice the list onto a local list to process.  However if an error
occurs, we only cleanup the transactions dirty block group list, not any
pending block groups we have on our locally spliced list.

In fact if we fail to allocate a path in this function we'll also fail
to clean up the splice list.

Fix this by splicing the list back onto the transaction dirty block
group list so that the block groups are cleaned up.  Then add a 'out'
label and have the error conditions jump to out so that the errors are
handled properly.  This also has the side-effect of fixing a problem
where we would clear 'ret' on error because we unconditionally ran
btrfs_run_delayed_refs().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:44 +01:00
Josef Bacik
dc0780e456 btrfs: fix reloc root leak with 0 ref reloc roots on recovery
commit c78a10aebb275c38d0cfccae129a803fe622e305 upstream.

When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later.  The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.

This exit condition simply isn't correct really.  During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly.  During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.

Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately.  Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:44 +01:00
Josef Bacik
c0baf3aaf4 btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
commit 867ed321f90d06aaba84e2c91de51cd3038825ef upstream.

While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing.  However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:

btrfs_inc_extent_ref
  btrfs_copy_root
    create_reloc_root
      btrfs_init_reloc_root
	btrfs_record_root_in_trans
	  btrfs_start_transaction
	    btrfs_update_inode
	      btrfs_update_time
		touch_atime
		  file_accessed
		    btrfs_file_mmap

This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem.  Fix this by aborting the transaction if we fail
to do the reference modification.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:44 +01:00
Zhihao Cheng
6f651ec266 btrfs: clarify error returns values in __load_free_space_cache
[ Upstream commit 3cc64e7ebfb0d7faaba2438334c43466955a96e8 ]

Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.

Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 10:26:24 +01:00
David Sterba
38d777aaf2 btrfs: fix backport of 2175bf57dc952 in 5.4.95
There's a mistake in backport of upstream commit 2175bf57dc95 ("btrfs:
fix possible free space tree corruption with online conversion") as
5.4.95 commit e1ae9aab8029.

The enum value BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED has been added to the
wrong enum set, colliding with value of BTRFS_FS_QUOTA_ENABLE. This
could cause problems during the tree conversion, where the quotas
wouldn't be set up properly but the related code executed anyway due to
the bit set.

Link: https://lore.kernel.org/linux-btrfs/20210219111741.95DD.409509F4@e16-tech.com
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
CC: stable@vger.kernel.org # 5.4.95+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23 15:02:26 +01:00
ethanwu
27afc71283 btrfs: backref, use correct count to resolve normal data refs
commit b25b0b871f206936d5bca02b80d38c05623e27da upstream.

With the following patches:

- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root

we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.

Background and how the patches fit together:

Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).

When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.

For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
  extent refs 24 gen 7302 flags DATA
  shared data backref parent 394985472 count 10 #1
  extent data backref root 257 objectid 260 offset 1048576 count 3 #2
  extent data backref root 256 objectid 260 offset 65536 count 6 #3
  extent data backref root 257 objectid 260 offset 65536 count 5 #4

For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.

But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root.  As a result, the loop keeps on going
until we reach the end of that inode.

The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.

This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.

btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:

 $ btrfs subvolume create /sub1
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot /sub1 /sub2
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot -r /sub1 /snap1
 $ time btrfs send /snap1 | btrfs receive /volume2

Without this patch:

real 69m48.124s
user 0m50.199s
sys  70m15.600s

With this patch:

real    1m59.683s
user    0m35.421s
sys     2m42.684s

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07 15:35:47 +01:00
ethanwu
66bcf5f6f9 btrfs: backref, only search backref entries from leaves of the same root
commit cfc0eed0ec89db7c4a8d461174cabfaa4a0912c7 upstream.

We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07 15:35:47 +01:00
ethanwu
c3089b06d6 btrfs: backref, don't add refs from shared block when resolving normal backref
commit ed58f2e66e849c34826083e5a6c1b506ee8a4d8e upstream.

All references from the block of SHARED_DATA_REF belong to that shared
block backref.

For example:

  item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
      extent refs 24 gen 7302 flags DATA
      extent data backref root 257 objectid 260 offset 65536 count 5
      extent data backref root 258 objectid 265 offset 0 count 9
      shared data backref parent 394985472 count 10

Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.

Before this patch, when we resolve backref:

  root 257 objectid 260 offset 65536

we will add those refs in block 394985472 and wrongly treat those as the
refs we want.

Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.

Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.

Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07 15:35:47 +01:00
ethanwu
21a0c97fb2 btrfs: backref, only collect file extent items matching backref offset
commit 7ac8b88ee668a5b4743ebf3e9888fabac85c334a upstream.

When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.

This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07 15:35:47 +01:00
Josef Bacik
e1ae9aab80 btrfs: fix possible free space tree corruption with online conversion
commit 2f96e40212d435b328459ba6b3956395eed8fa9f upstream.

While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups.  While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:25:57 +01:00
Filipe Manana
d8a487e673 btrfs: send: fix invalid clone operations when cloning from the same file and root
commit 518837e65068c385dddc0a87b3e577c8be7c13b1 upstream.

When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5f4 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.

Example reproducer:

  $ cat test-send-clone.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  # Create our test file with a single and large extent (1M) and with
  # different content for different file ranges that will be reflinked
  # later.
  xfs_io -f \
         -c "pwrite -S 0xab 0 128K" \
         -c "pwrite -S 0xcd 128K 128K" \
         -c "pwrite -S 0xef 256K 256K" \
         -c "pwrite -S 0x1a 512K 512K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now do a series of changes to our file such that we end up with
  # different parts of the extent reflinked into different file offsets
  # and we overwrite a large part of the extent too, so no file extent
  # items refer to that part that was overwritten. This used to confuse
  # the algorithm used by the kernel to figure out which file ranges to
  # clone, making it attempt to clone from a source range starting at
  # the current eof of the file, resulting in the receiver to fail since
  # it is an invalid clone operation.
  #
  xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
         -c "reflink $MNT/foobar 0K 512K 256K" \
         -c "reflink $MNT/foobar 512K 128K 256K" \
         -c "pwrite -S 0x73 384K 640K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  echo -e "\nFile digest in the original filesystem:"
  md5sum $MNT/snap2/foobar

  # Now unmount the filesystem, create a new one, mount it and try to
  # apply both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  # Must match what we got in the original filesystem of course.
  echo -e "\nFile digest in the new filesystem:"
  md5sum $MNT/snap2/foobar

  umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

  $ ./test-send-clone.sh
  wrote 131072/131072 bytes at offset 0
  128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
  wrote 131072/131072 bytes at offset 131072
  128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
  wrote 262144/262144 bytes at offset 262144
  256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
  wrote 524288/524288 bytes at offset 524288
  512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  linked 983040/983040 bytes at offset 1048576
  960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
  linked 262144/262144 bytes at offset 524288
  256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
  linked 262144/262144 bytes at offset 131072
  256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
  wrote 655360/655360 bytes at offset 393216
  640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2

  File digest in the original filesystem:
  9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to foobar: Invalid argument

  File digest in the new filesystem:
  132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
   the extent is shared and can be (fully or partially) found at file
   offset 0.

   At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
   64K and, since the file extent item at file offset 0 has a data offset
   of 0, it issues a clone operation, from the same file and root, that
   has a source range offset of 64K, destination offset of 1M and a length
   of 64K, since the extent item at file offset 0 refers only to the first
   128K of the shared extent.

   After this clone operation, the file size (and eof) at the receiver is
   increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
   it checks for the next file extent item, which starts at file offset
   128K. This file extent item has a data offset of 0 and a length of
   256K, so a clone operation with a source range offset of 256K, a
   destination offset of 1088K (1M + 64K) and length of 128K is issued.

   After this operation the file size (and eof) at the receiver increases
   from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
   it checks for the next file extent item, located at file offset 384K.
   This file extent item points to a different extent, not the one we want
   to clone, with a length of 640K. So we issue a write operation into the
   file range 1216K (1088K + 128K, end of the last clone operation), with
   a length of 640K and with a data matching the one we can find for that
   range in send root.

   After this operation, the file size (and eof) at the receiver increases
   from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
   we look into the file extent item, which is for file offset 1M and it
   points to the extent we want to clone, with a data offset of 64K and a
   length of 960K.

   However this matches the file offset we started with, the start of the
   range to clone into. So we can't for sure find any file extent item
   from here onwards with the rest of the data we want to clone, yet we
   proceed and since the file extent item points to the shared extent,
   with a data offset of 64K, we issue a clone operation with a source
   range starting at file offset 1856K, which matches the file extent
   item's offset, 1M, plus the amount of data cloned and written so far,
   which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
   operation is invalid since the source range offset matches the current
   eof of the file in the receiver. We should have stopped looking for
   extents to clone at this point and instead fallback to write, which
   would simply the contain the data in the file range from 1856K to
   1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c113e ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:47:41 +01:00
Josef Bacik
4d1cf8eeda btrfs: don't clear ret in btrfs_start_dirty_block_groups
commit 34d1eb0e599875064955a74712f08ff14c8e3d5f upstream.

If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly.  Fix this by only running the delayed refs
if there was no failure.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:47:40 +01:00
Josef Bacik
e1065331b7 btrfs: fix lockdep splat in btrfs_recover_relocation
commit fb286100974e7239af243bc2255a52f29442f9c8 upstream.

While testing the error paths of relocation I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc6+ #217 Not tainted
  ------------------------------------------------------
  mount/779 is trying to acquire lock:
  ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340

  but task is already holding lock:
  ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-root-00){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x462/0x8f0
	 btrfs_update_root+0x55/0x2b0
	 btrfs_drop_snapshot+0x398/0x750
	 clean_dirty_subvols+0xdf/0x120
	 btrfs_recover_relocation+0x534/0x5a0
	 btrfs_start_pre_rw_mount+0xcb/0x170
	 open_ctree+0x151f/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (sb_internal#2){.+.+}-{0:0}:
	 start_transaction+0x444/0x700
	 insert_balance_item.isra.0+0x37/0x320
	 btrfs_balance+0x354/0xf40
	 btrfs_ioctl_balance+0x2cf/0x380
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1120/0x1e10
	 lock_acquire+0x116/0x370
	 __mutex_lock+0x7e/0x7b0
	 btrfs_recover_balance+0x2f0/0x340
	 open_ctree+0x1095/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(sb_internal#2);
				 lock(btrfs-root-00);
    lock(&fs_info->balance_mutex);

   *** DEADLOCK ***

  2 locks held by mount/779:
   #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  stack backtrace:
  CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1120/0x1e10
   lock_acquire+0x116/0x370
   ? btrfs_recover_balance+0x2f0/0x340
   __mutex_lock+0x7e/0x7b0
   ? btrfs_recover_balance+0x2f0/0x340
   ? btrfs_recover_balance+0x2f0/0x340
   ? rcu_read_lock_sched_held+0x3f/0x80
   ? kmem_cache_alloc_trace+0x2c4/0x2f0
   ? btrfs_get_64+0x5e/0x100
   btrfs_recover_balance+0x2f0/0x340
   open_ctree+0x1095/0x1726
   btrfs_mount_root.cold+0x12/0xea
   ? rcu_read_lock_sched_held+0x3f/0x80
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   ? __kmalloc_track_caller+0x2f2/0x320
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   ? capable+0x3a/0x60
   path_mount+0x433/0xc10
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is straightforward to fix, simply release the path before we setup
the balance_ctl.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:47:40 +01:00
Josef Bacik
6871845315 btrfs: don't get an EINTR during drop_snapshot for reloc
commit 18d3bff411c8d46d40537483bdc0b61b33ce0371 upstream.

This was partially fixed by f3e3d9cc3525 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction.  The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.

Fixes: f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:47:40 +01:00
Filipe Manana
d5f996bea4 btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
[ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

If we remount a filesystem in RO mode while the qgroup rescan worker is
running, we can end up having it still running after the remount is done,
and at unmount time we may end up with an open transaction that ends up
never getting committed. If that happens we end up with several memory
leaks and can crash when hardware acceleration is unavailable for crc32c.
Possibly it can lead to other nasty surprises too, due to use-after-free
issues.

The following steps explain how the problem happens.

1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
   running;

2) We remount the filesystem in RO mode, and never stop/pause the rescan
   worker, so after the remount the rescan worker is still running. The
   important detail here is that the rescan task is still running after
   the remount operation committed any ongoing transaction through its
   call to btrfs_commit_super();

3) The rescan is still running, and after the remount completed, the
   rescan worker started a transaction, after it finished iterating all
   leaves of the extent tree, to update the qgroup status item in the
   quotas tree. It does not commit the transaction, it only releases its
   handle on the transaction;

4) A filesystem unmount operation starts shortly after;

5) The unmount task, at close_ctree(), stops the transaction kthread,
   which had not had a chance to commit the open transaction since it was
   sleeping and the commit interval (default of 30 seconds) has not yet
   elapsed since the last time it committed a transaction;

6) So after stopping the transaction kthread we still have the transaction
   used to update the qgroup status item open. At close_ctree(), when the
   filesystem is in RO mode and no transaction abort happened (or the
   filesystem is in error mode), we do not expect to have any transaction
   open, so we do not call btrfs_commit_super();

7) We then proceed to destroy the work queues, free the roots and block
   groups, etc. After that we drop the last reference on the btree inode
   by calling iput() on it. Since there are dirty pages for the btree
   inode, corresponding to the COWed extent buffer for the quotas btree,
   btree_write_cache_pages() is invoked to flush those dirty pages. This
   results in creating a bio and submitting it, which makes us end up at
   btrfs_submit_metadata_bio();

8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
   that calls btrfs_wq_submit_bio(), because check_async_write() returned
   a value of 1. This value of 1 is because we did not have hardware
   acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
   set in fs_info->flags;

9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
   workqueue at fs_info->workers, which was already freed before by the
   call to btrfs_stop_all_workers() at close_ctree(). This results in an
   invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
  Code: f0 01 00 00 48 39 c2 75 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
  RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
  RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 48 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c6 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Code: 48 83 bb b0 03 00 00 00 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
  RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c7 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Code: ad de 49 be 22 01 00 (...)
  RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
  RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
  RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c8 ]---
  BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
  BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
  BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

  stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
  Code: 54 55 53 48 89 f3 (...)
  RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
  RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
  RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
  R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
  FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
   btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
   submit_one_bio+0x61/0x70 [btrfs]
   btree_write_cache_pages+0x414/0x450 [btrfs]
   ? kobject_put+0x9a/0x1d0
   ? trace_hardirqs_on+0x1b/0xf0
   ? _raw_spin_unlock_irqrestore+0x3c/0x60
   ? free_debug_processing+0x1e1/0x2b0
   do_writepages+0x43/0xe0
   ? lock_acquired+0x199/0x490
   __writeback_single_inode+0x59/0x650
   writeback_single_inode+0xaf/0x120
   write_inode_now+0x94/0xd0
   iput+0x187/0x2b0
   close_ctree+0x2c6/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f3cfebabee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
  RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
  R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

  =============================================================================
  BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x0000000050cbdd61 @offset=12104
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
	btrfs_free_tree_block+0x128/0x360 [btrfs]
	__btrfs_cow_block+0x489/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	sync_filesystem+0x74/0x90
	generic_shutdown_super+0x22/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x0000000086e9b0ff @offset=12776
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
	commit_cowonly_roots+0x248/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000001a340018 @offset=4408
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
	btrfs_free_tree_block+0x128/0x360 [btrfs]
	__btrfs_cow_block+0x489/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	btrfs_commit_transaction+0x60/0xc40 [btrfs]
	create_subvol+0x56a/0x990 [btrfs]
	btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
	__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
	btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
	btrfs_ioctl+0x1a92/0x36f0 [btrfs]
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x33/0x80
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x000000002b46292a @offset=13648
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? __mutex_unlock_slowpath+0x45/0x2a0
   kmem_cache_destroy+0x55/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000004cf95ea8 @offset=6264
  INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

Fix this issue by having the remount path stop the qgroup rescan worker
when we are remounting RO and teach the rescan worker to stop when a
remount is in progress. If later a remount in RW mode happens, we are
already resuming the qgroup rescan worker through the call to
btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:26:14 +01:00
Su Yue
72eb9fc82a btrfs: tree-checker: check if chunk item end overflows
[ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

While mounting a crafted image provided by user, kernel panics due to
the invalid chunk item whose end is less than start.

  [66.387422] loop: module loaded
  [66.389773] loop0: detected capacity change from 262144 to 0
  [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
  [66.431061] BTRFS info (device loop0): disk space caching is enabled
  [66.431078] BTRFS info (device loop0): has skinny extents
  [66.437101] BTRFS error: insert state: end < start 29360127 37748736
  [66.437136] ------------[ cut here ]------------
  [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
  [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G           O      5.11.0-rc1-custom #45
  [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
  [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
  [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
  [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
  [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
  [66.437447] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.437451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.437460] PKRU: 55555554
  [66.437464] Call Trace:
  [66.437475]  set_extent_bit+0x652/0x740 [btrfs]
  [66.437539]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.437576]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.437621]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.437674]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.437708]  ? kvm_sched_clock_read+0x18/0x40
  [66.437739]  open_ctree+0xb32/0x1734 [btrfs]
  [66.437781]  ? bdi_register_va+0x1b/0x20
  [66.437788]  ? super_setup_bdi_name+0x79/0xd0
  [66.437810]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.437854]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437873]  legacy_get_tree+0x34/0x60
  [66.437880]  vfs_get_tree+0x2d/0xc0
  [66.437888]  vfs_kern_mount.part.0+0x78/0xc0
  [66.437897]  vfs_kern_mount+0x13/0x20
  [66.437902]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.437940]  ? kfree+0x5ff/0x670
  [66.437944]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437962]  legacy_get_tree+0x34/0x60
  [66.437974]  vfs_get_tree+0x2d/0xc0
  [66.437983]  path_mount+0x48c/0xd30
  [66.437998]  __x64_sys_mount+0x108/0x140
  [66.438011]  do_syscall_64+0x38/0x50
  [66.438018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.438023] RIP: 0033:0x7f0138827f6e
  [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [66.438078] irq event stamp: 18169
  [66.438082] hardirqs last  enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0
  [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0
  [66.438092] softirqs last  enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438103] ---[ end trace e114b111db64298b ]---
  [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
  [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
  [66.441069] ------------[ cut here ]------------
  [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
  [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G        W  O      5.11.0-rc1-custom #45
  [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [66.458356] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.459841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.462196] PKRU: 55555554
  [66.462692] Call Trace:
  [66.463139]  set_extent_bit.cold+0x30/0x98 [btrfs]
  [66.464049]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.490466]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.514097]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.534976]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.555718]  ? kvm_sched_clock_read+0x18/0x40
  [66.575758]  open_ctree+0xb32/0x1734 [btrfs]
  [66.595272]  ? bdi_register_va+0x1b/0x20
  [66.614638]  ? super_setup_bdi_name+0x79/0xd0
  [66.633809]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.652938]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.671925]  legacy_get_tree+0x34/0x60
  [66.690300]  vfs_get_tree+0x2d/0xc0
  [66.708221]  vfs_kern_mount.part.0+0x78/0xc0
  [66.725808]  vfs_kern_mount+0x13/0x20
  [66.742730]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.759350]  ? kfree+0x5ff/0x670
  [66.775441]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.791750]  legacy_get_tree+0x34/0x60
  [66.807494]  vfs_get_tree+0x2d/0xc0
  [66.823349]  path_mount+0x48c/0xd30
  [66.838753]  __x64_sys_mount+0x108/0x140
  [66.854412]  do_syscall_64+0x38/0x50
  [66.869673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.885093] RIP: 0033:0x7f0138827f6e
  [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [67.216138] ---[ end trace e114b111db64298c ]---
  [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [67.465436] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [67.511660] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [67.558449] PKRU: 55555554
  [67.581146] note: mount[613] exited with preempt_count 2

The image has a chunk item which has a logical start 37748736 and length
18446744073701163008 (-8M). The calculated end 29360127 overflows.
EEXIST was caught by insert_state() because of the duplicate end and
extent_io_tree_panic() was called.

Add overflow check of chunk item end to tree checker so it can be
detected early at mount time.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:26:13 +01:00
Su Yue
82d1a5f6f2 btrfs: prevent NULL pointer dereference in extent_io_tree_panic
commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19 18:26:11 +01:00
Filipe Manana
64d06c7f2f btrfs: send: fix wrong file path when there is an inode with a pending rmdir
commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.

When doing an incremental send, if we have a new inode that happens to
have the same number that an old directory inode had in the base snapshot
and that old directory has a pending rmdir operation, we end up computing
a wrong path for the new inode, causing the receiver to fail.

Example reproducer:

  $ cat test-send-rmdir.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  mkdir $MNT/dir
  touch $MNT/dir/file1
  touch $MNT/dir/file2
  touch $MNT/dir/file3

  # Filesystem looks like:
  #
  # .                                     (ino 256)
  # |----- dir/                           (ino 257)
  #         |----- file1                  (ino 258)
  #         |----- file2                  (ino 259)
  #         |----- file3                  (ino 260)
  #

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now remove our directory and all its files.
  rm -fr $MNT/dir

  # Unmount the filesystem and mount it again. This is to ensure that
  # the next inode that is created ends up with the same inode number
  # that our directory "dir" had, 257, which is the first free "objectid"
  # available after mounting again the filesystem.
  umount $MNT
  mount $DEV $MNT

  # Now create a new file (it could be a directory as well).
  touch $MNT/newfile

  # Filesystem now looks like:
  #
  # .                                     (ino 256)
  # |----- newfile                        (ino 257)
  #

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  # Now unmount the filesystem, create a new one, mount it and try to apply
  # both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null

  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  umount $MNT

When running the test, the receive operation for the incremental stream
fails:

  $ ./test-send-rmdir.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: chown o257-9-0 failed: No such file or directory

So fix this by tracking directories that have a pending rmdir by inode
number and generation number, instead of only inode number.

A test case for fstests follows soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Tested-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:16:23 +01:00
Filipe Manana
6b0a4f603d btrfs: fix race when defragmenting leads to unnecessary IO
[ Upstream commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 ]

When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).

So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.

I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-06 14:48:36 +01:00
Qu Wenruo
e21d630a2c btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
commit c57dd1f2f6a7cd1bb61802344f59ccdc5278c983 upstream

[BUG]
The following script can lead to tons of beyond device boundary access:

  mkfs.btrfs -f $dev -b 10G
  mount $dev $mnt
  trimfs $mnt
  btrfs filesystem resize 1:-1G $mnt
  trimfs $mnt

[CAUSE]
Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.

So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.

But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.

This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.

Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.

[FIX]
This patch will fix the problem in two ways:

- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
  This is the root fix

- Add extra safety check when trimming free device extents
  We check and warn if the returned range is already beyond current
  device.

Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: adjust context and use extent_io.h]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:37 +01:00
Josef Bacik
1d11ed122f btrfs: do not shorten unpin len for caching block groups
commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.

While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space.  This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage.  This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.

Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:37 +01:00
Filipe Manana
a6676b0fa0 btrfs: fix lockdep splat when reading qgroup config on mount
commit 3d05cad3c357a2b749912914356072b38435edfa upstream.

Lockdep reported the following splat when running test btrfs/190 from
fstests:

  [ 9482.126098] ======================================================
  [ 9482.126184] WARNING: possible circular locking dependency detected
  [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
  [ 9482.126365] ------------------------------------------------------
  [ 9482.126456] mount/24187 is trying to acquire lock:
  [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.126647]
		 but task is already holding lock:
  [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.126886]
		 which lock already depends on the new lock.

  [ 9482.127078]
		 the existing dependency chain (in reverse order) is:
  [ 9482.127213]
		 -> #1 (btrfs-quota-00){++++}-{3:3}:
  [ 9482.127366]        lock_acquire+0xd8/0x490
  [ 9482.127436]        down_read_nested+0x45/0x220
  [ 9482.127528]        __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.127613]        btrfs_read_lock_root_node+0x41/0x130 [btrfs]
  [ 9482.127702]        btrfs_search_slot+0x514/0xc30 [btrfs]
  [ 9482.127788]        update_qgroup_status_item+0x72/0x140 [btrfs]
  [ 9482.127877]        btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
  [ 9482.127964]        btrfs_work_helper+0xf1/0x600 [btrfs]
  [ 9482.128039]        process_one_work+0x24e/0x5e0
  [ 9482.128110]        worker_thread+0x50/0x3b0
  [ 9482.128181]        kthread+0x153/0x170
  [ 9482.128256]        ret_from_fork+0x22/0x30
  [ 9482.128327]
		 -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
  [ 9482.128464]        check_prev_add+0x91/0xc60
  [ 9482.128551]        __lock_acquire+0x1740/0x3110
  [ 9482.128623]        lock_acquire+0xd8/0x490
  [ 9482.130029]        __mutex_lock+0xa3/0xb30
  [ 9482.130590]        qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.131577]        btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
  [ 9482.132175]        open_ctree+0x1228/0x18a0 [btrfs]
  [ 9482.132756]        btrfs_mount_root.cold+0x13/0xed [btrfs]
  [ 9482.133325]        legacy_get_tree+0x30/0x60
  [ 9482.133866]        vfs_get_tree+0x28/0xe0
  [ 9482.134392]        fc_mount+0xe/0x40
  [ 9482.134908]        vfs_kern_mount.part.0+0x71/0x90
  [ 9482.135428]        btrfs_mount+0x13b/0x3e0 [btrfs]
  [ 9482.135942]        legacy_get_tree+0x30/0x60
  [ 9482.136444]        vfs_get_tree+0x28/0xe0
  [ 9482.136949]        path_mount+0x2d7/0xa70
  [ 9482.137438]        do_mount+0x75/0x90
  [ 9482.137923]        __x64_sys_mount+0x8e/0xd0
  [ 9482.138400]        do_syscall_64+0x33/0x80
  [ 9482.138873]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 9482.139346]
		 other info that might help us debug this:

  [ 9482.140735]  Possible unsafe locking scenario:

  [ 9482.141594]        CPU0                    CPU1
  [ 9482.142011]        ----                    ----
  [ 9482.142411]   lock(btrfs-quota-00);
  [ 9482.142806]                                lock(&fs_info->qgroup_rescan_lock);
  [ 9482.143216]                                lock(btrfs-quota-00);
  [ 9482.143629]   lock(&fs_info->qgroup_rescan_lock);
  [ 9482.144056]
		  *** DEADLOCK ***

  [ 9482.145242] 2 locks held by mount/24187:
  [ 9482.145637]  #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
  [ 9482.146061]  #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.146509]
		 stack backtrace:
  [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
  [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [ 9482.148709] Call Trace:
  [ 9482.149169]  dump_stack+0x8d/0xb5
  [ 9482.149628]  check_noncircular+0xff/0x110
  [ 9482.150090]  check_prev_add+0x91/0xc60
  [ 9482.150561]  ? kvm_clock_read+0x14/0x30
  [ 9482.151017]  ? kvm_sched_clock_read+0x5/0x10
  [ 9482.151470]  __lock_acquire+0x1740/0x3110
  [ 9482.151941]  ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.152402]  lock_acquire+0xd8/0x490
  [ 9482.152887]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.153354]  __mutex_lock+0xa3/0xb30
  [ 9482.153826]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.154301]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.154768]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.155226]  qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.155690]  btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
  [ 9482.156160]  open_ctree+0x1228/0x18a0 [btrfs]
  [ 9482.156643]  btrfs_mount_root.cold+0x13/0xed [btrfs]
  [ 9482.157108]  ? rcu_read_lock_sched_held+0x5d/0x90
  [ 9482.157567]  ? kfree+0x31f/0x3e0
  [ 9482.158030]  legacy_get_tree+0x30/0x60
  [ 9482.158489]  vfs_get_tree+0x28/0xe0
  [ 9482.158947]  fc_mount+0xe/0x40
  [ 9482.159403]  vfs_kern_mount.part.0+0x71/0x90
  [ 9482.159875]  btrfs_mount+0x13b/0x3e0 [btrfs]
  [ 9482.160335]  ? rcu_read_lock_sched_held+0x5d/0x90
  [ 9482.160805]  ? kfree+0x31f/0x3e0
  [ 9482.161260]  ? legacy_get_tree+0x30/0x60
  [ 9482.161714]  legacy_get_tree+0x30/0x60
  [ 9482.162166]  vfs_get_tree+0x28/0xe0
  [ 9482.162616]  path_mount+0x2d7/0xa70
  [ 9482.163070]  do_mount+0x75/0x90
  [ 9482.163525]  __x64_sys_mount+0x8e/0xd0
  [ 9482.163986]  do_syscall_64+0x33/0x80
  [ 9482.164437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 9482.164902] RIP: 0033:0x7f51e907caaa

This happens because at btrfs_read_qgroup_config() we can call
qgroup_rescan_init() while holding a read lock on a quota btree leaf,
acquired by the previous call to btrfs_search_slot_for_read(), and
qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.

A qgroup rescan worker does the opposite: it acquires the mutex
qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
update the qgroup status item in the quota btree through the call to
update_qgroup_status_item(). This inversion of locking order
between the qgroup_rescan_lock mutex and quota btree locks causes the
splat.

Fix this simply by releasing and freeing the path before calling
qgroup_rescan_init() at btrfs_read_qgroup_config().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02 08:49:45 +01:00
Johannes Thumshirn
6ea14731ac btrfs: don't access possibly stale fs_info data for printing duplicate device
commit 0697d9a610998b8bdee6b2390836cb2391d8fd1a upstream.

Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().

At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.

  ==================================================================
  BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
  Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

  CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1d6/0x29e lib/dump_stack.c:118
   print_address_description+0x66/0x620 mm/kasan/report.c:383
   __kasan_report mm/kasan/report.c:513 [inline]
   kasan_report+0x132/0x1d0 mm/kasan/report.c:530
   btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
   device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
   btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
   btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x44840a
  RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
  RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
  RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
  R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

  Allocated by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track mm/kasan/common.c:56 [inline]
   __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
   kmalloc_node include/linux/slab.h:577 [inline]
   kvmalloc_node+0x81/0x110 mm/util.c:574
   kvmalloc include/linux/mm.h:757 [inline]
   kvzalloc include/linux/mm.h:765 [inline]
   btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
   kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
   __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
   __cache_free mm/slab.c:3418 [inline]
   kfree+0x113/0x200 mm/slab.c:3756
   deactivate_locked_super+0xa7/0xf0 fs/super.c:335
   btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff8880878e0000
   which belongs to the cache kmalloc-16k of size 16384
  The buggy address is located 1704 bytes inside of
   16384-byte region [ffff8880878e0000, ffff8880878e4000)
  The buggy address belongs to the page:
  page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
  head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
  flags: 0xfffe0000010200(slab|head)
  raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
  raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
				    ^
   ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.

Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.

There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02 08:49:45 +01:00
David Sterba
12aedea582 btrfs: tree-checker: add missing returns after data_ref alignment checks
commit 6d06b0ad94d3dd7e3503d8ad39c39c4634884611 upstream.

There are sectorsize alignment checks that are reported but then
check_extent_data_ref continues. This was not intended, wrong alignment
is not a minor problem and we should return with error.

CC: stable@vger.kernel.org # 5.4+
Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02 08:49:45 +01:00
Daniel Xu
0115a26133 btrfs: tree-checker: add missing return after error in root_item
commit 1a49a97df657c63a4e8ffcd1ea9b6ed95581789b upstream.

There's a missing return statement after an error is found in the
root_item, this can cause further problems when a crafted image triggers
the error.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02 08:49:44 +01:00
Matthew Wilcox (Oracle)
33e53f2cac btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch
commit a1fbc6750e212c5675a4e48d7f51d44607eb8756 upstream.

On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.

CC: stable@vger.kernel.org # 4.4+
Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ define the variable instead of repeating the shift ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18 19:20:30 +01:00
Anand Jain
2033dd8852 btrfs: dev-replace: fail mount if we don't have replace item with target device
commit cf89af146b7e62af55470cf5f3ec3c56ec144a5e upstream.

If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image.  Fail the mount as this needs a closer
look what is actually wrong.

As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:

  WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x198/0x1fd lib/dump_stack.c:118
   panic+0x347/0x7c0 kernel/panic.c:231
   __warn.cold+0x20/0x46 kernel/panic.c:600
   report_bug+0x1bd/0x210 lib/bug.c:198
   handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
   exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
   asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
  RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
  RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
  RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
  RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
  RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
  R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
   close_fs_devices fs/btrfs/volumes.c:1193 [inline]
   btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
   open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
   btrfs_fill_super fs/btrfs/super.c:1316 [inline]
   btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).

CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18 19:20:29 +01:00
Josef Bacik
5af9630036 btrfs: fix min reserved size calculation in merge_reloc_root
commit fca3a45d08782a2bb85e048fb8e3128b1388d7b7 upstream.

The minimum reserve size was adjusted to take into account the height of
the tree we are merging, however we can have a root with a level == 0.
What we want is root_level + 1 to get the number of nodes we may have to
cow.  This fixes the enospc_debug warning pops with btrfs/101.

Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
btrfs/195 That I was seeing, the call trace was:

  [ 3680.515564] ------------[ cut here ]------------
  [ 3680.515566] BTRFS: block rsv returned -28
  [ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
  [ 3680.515587] Modules linked in:
  [ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G        W         5.9.0-rc8-default #95
  [ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
  [ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
  [ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
  [ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
  [ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
  [ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
  [ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
  [ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
  [ 3680.515613] FS:  00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
  [ 3680.515615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
  [ 3680.515620] Call Trace:
  [ 3680.515627]  btrfs_alloc_tree_block+0x8b/0x340
  [ 3680.515633]  ? __lock_acquire+0x51a/0xac0
  [ 3680.515646]  alloc_tree_block_no_bg_flush+0x4f/0x60
  [ 3680.515651]  __btrfs_cow_block+0x14e/0x7e0
  [ 3680.515662]  btrfs_cow_block+0x144/0x2c0
  [ 3680.515670]  merge_reloc_root+0x4d4/0x610
  [ 3680.515675]  ? btrfs_lookup_fs_root+0x78/0x90
  [ 3680.515686]  merge_reloc_roots+0xee/0x280
  [ 3680.515695]  relocate_block_group+0x2ce/0x5e0
  [ 3680.515704]  btrfs_relocate_block_group+0x16e/0x310
  [ 3680.515711]  btrfs_relocate_chunk+0x38/0xf0
  [ 3680.515716]  btrfs_shrink_device+0x200/0x560
  [ 3680.515728]  btrfs_rm_device+0x1ae/0x6a6
  [ 3680.515744]  ? _copy_from_user+0x6e/0xb0
  [ 3680.515750]  btrfs_ioctl+0x1afe/0x28c0
  [ 3680.515755]  ? find_held_lock+0x2b/0x80
  [ 3680.515760]  ? do_user_addr_fault+0x1f8/0x418
  [ 3680.515773]  ? __x64_sys_ioctl+0x77/0xb0
  [ 3680.515775]  __x64_sys_ioctl+0x77/0xb0
  [ 3680.515781]  do_syscall_64+0x31/0x70
  [ 3680.515785]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18 19:20:28 +01:00
Dinghao Liu
8266c23124 btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
commit 468600c6ec28613b756193c5f780aac062f1acdf upstream.

There is one error handling path that does not free ref, which may cause
a minor memory leak.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18 19:20:28 +01:00
Johannes Thumshirn
a8ec66026d btrfs: reschedule when cloning lots of extents
[ Upstream commit 6b613cc97f0ace77f92f7bc112b8f6ad3f52baf8 ]

We have several occurrences of a soft lockup from fstest's generic/175
testcase, which look more or less like this one:

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack+0x77/0xa0
   panic+0xfa/0x2cb
   watchdog_timer_fn.cold+0x85/0xa5
   ? lockup_detector_update_enable+0x50/0x50
   __hrtimer_run_queues+0x99/0x4c0
   ? recalibrate_cpu_khz+0x10/0x10
   hrtimer_run_queues+0x9f/0xb0
   update_process_times+0x28/0x80
   tick_handle_periodic+0x1b/0x60
   __sysvec_apic_timer_interrupt+0x76/0x210
   asm_call_on_stack+0x12/0x20
   </IRQ>
   sysvec_apic_timer_interrupt+0x7f/0x90
   asm_sysvec_apic_timer_interrupt+0x12/0x20
  RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
  RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
  RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
  RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
  R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
  R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
   ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
   btrfs_release_path+0x67/0x80 [btrfs]
   btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
   btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
   btrfs_clone+0x9ba/0xbd0 [btrfs]
   btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
   ? file_update_time+0xcd/0x120
   btrfs_remap_file_range+0x322/0x3b0 [btrfs]
   do_clone_file_range+0xb7/0x1e0
   vfs_clone_file_range+0x30/0xa0
   ioctl_file_clone+0x8a/0xc0
   do_vfs_ioctl+0x5b2/0x6f0
   __x64_sys_ioctl+0x37/0xa0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f87977fc247
  RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
  RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
  RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
  R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
  R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---

All of these lockup reports have the call chain btrfs_clone_files() ->
btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
both source and destination extents locked and loops over the source
extent to create the clones.

Conditionally reschedule in the btrfs_clone() loop, to give some time back
to other processes.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18 19:20:16 +01:00
Josef Bacik
0ee771e969 btrfs: sysfs: init devices outside of the chunk_mutex
[ Upstream commit ca10845a56856fff4de3804c85e6424d0f6d0cde ]

While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc3+ #4 Not tainted
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc+0x37/0x270
	 alloc_inode+0x82/0xb0
	 iget_locked+0x10d/0x2c0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x240
	 sysfs_get_tree+0x16/0x40
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (kernfs_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_link+0x63/0xa0
	 sysfs_do_create_link_sd+0x5e/0xd0
	 btrfs_sysfs_add_devices_dir+0x81/0x130
	 btrfs_init_new_device+0x67f/0x1250
	 btrfs_ioctl+0x1ef/0x2e20
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_insert_empty_items+0x64/0xb0
	 btrfs_insert_delayed_items+0x90/0x4f0
	 btrfs_commit_inode_delayed_items+0x93/0x140
	 btrfs_log_inode+0x5de/0x2020
	 btrfs_log_inode_parent+0x429/0xc90
	 btrfs_log_new_name+0x95/0x9b
	 btrfs_rename2+0xbb9/0x1800
	 vfs_rename+0x64f/0x9f0
	 do_renameat2+0x320/0x4e0
	 __x64_sys_rename+0x1f/0x30
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x119c/0x1fc0
	 lock_acquire+0xa7/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x119c/0x1fc0
   lock_acquire+0xa7/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa7/0x3d0
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This happens because we are holding the chunk_mutex at the time of
adding in a new device.  However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices.  Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.

CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18 19:20:16 +01:00
Qu Wenruo
c58fa93b14 btrfs: tracepoints: output proper root owner for trace_find_free_extent()
The current trace event always output result like this:

 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)

T's saying we're allocating data extent for EXTENT tree, which is not
even possible.

It's because we always use EXTENT tree as the owner for
trace_find_free_extent() without using the @root from
btrfs_reserve_extent().

This patch will change the parameter to use proper @root for
trace_find_free_extent():

Now it looks much better:

 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)

Reported-by: Hans van Kranenburg <hans@knorrie.org>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-18 19:20:16 +01:00
Filipe Manana
c964d386e8 btrfs: fix readahead hang and use-after-free after removing a device
commit 66d204a16c94f24ad08290a7663ab67e7fc04e82 upstream.

Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

  [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
  [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
  [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
  [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
  [162513.514751] Call Trace:
  [162513.514761]  __schedule+0x5ce/0xd00
  [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.514771]  schedule+0x46/0xf0
  [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.514850]  ? finish_wait+0x90/0x90
  [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
  [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
  [162513.514894]  kthread+0x153/0x170
  [162513.514897]  ? kthread_stop+0x2c0/0x2c0
  [162513.514902]  ret_from_fork+0x22/0x30
  [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
  [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
  [162513.515682] Call Trace:
  [162513.515688]  __schedule+0x5ce/0xd00
  [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.515697]  schedule+0x46/0xf0
  [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.515716]  ? finish_wait+0x90/0x90
  [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.515761]  iterate_supers+0x87/0xf0
  [162513.515765]  ksys_sync+0x60/0xb0
  [162513.515768]  __do_sys_sync+0xa/0x10
  [162513.515771]  do_syscall_64+0x33/0x80
  [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.515781] RIP: 0033:0x7f5238f50bd7
  [162513.515782] Code: Bad RIP value.
  [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
  [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
  [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
  [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
  [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
  [162513.516620] Call Trace:
  [162513.516625]  __schedule+0x5ce/0xd00
  [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.516634]  schedule+0x46/0xf0
  [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.516650]  ? finish_wait+0x90/0x90
  [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
  [162513.516686]  __vfs_setxattr+0x66/0x80
  [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
  [162513.516697]  vfs_setxattr+0x6b/0x120
  [162513.516703]  setxattr+0x125/0x240
  [162513.516709]  ? lock_acquire+0xb1/0x480
  [162513.516712]  ? mnt_want_write+0x20/0x50
  [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
  [162513.516723]  ? preempt_count_add+0x49/0xa0
  [162513.516725]  ? __sb_start_write+0x19b/0x290
  [162513.516727]  ? preempt_count_add+0x49/0xa0
  [162513.516732]  path_setxattr+0xba/0xd0
  [162513.516739]  __x64_sys_setxattr+0x27/0x30
  [162513.516741]  do_syscall_64+0x33/0x80
  [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.516745] RIP: 0033:0x7f5238f56d5a
  [162513.516746] Code: Bad RIP value.
  [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
  [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
  [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
  [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
  [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
  [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
  [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
  [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
  [162513.517780] Call Trace:
  [162513.517786]  __schedule+0x5ce/0xd00
  [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.517796]  schedule+0x46/0xf0
  [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.517814]  ? finish_wait+0x90/0x90
  [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.517865]  iterate_supers+0x87/0xf0
  [162513.517869]  ksys_sync+0x60/0xb0
  [162513.517872]  __do_sys_sync+0xa/0x10
  [162513.517875]  do_syscall_64+0x33/0x80
  [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.517881] RIP: 0033:0x7f5238f50bd7
  [162513.517883] Code: Bad RIP value.
  [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
  [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
  [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
  [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
  [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
  [162513.519160] Call Trace:
  [162513.519165]  __schedule+0x5ce/0xd00
  [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.519174]  schedule+0x46/0xf0
  [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.519193]  ? finish_wait+0x90/0x90
  [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
  [162513.519230]  lookup_open+0x522/0x650
  [162513.519246]  path_openat+0x2b8/0xa50
  [162513.519270]  do_filp_open+0x91/0x100
  [162513.519275]  ? find_held_lock+0x32/0x90
  [162513.519280]  ? lock_acquired+0x33b/0x470
  [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
  [162513.519287]  ? _raw_spin_unlock+0x29/0x40
  [162513.519295]  do_sys_openat2+0x20d/0x2d0
  [162513.519300]  do_sys_open+0x44/0x80
  [162513.519304]  do_syscall_64+0x33/0x80
  [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.519309] RIP: 0033:0x7f5238f4a903
  [162513.519310] Code: Bad RIP value.
  [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
  [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
  [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
  [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
  [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
  [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
  [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
  [162513.520511] Call Trace:
  [162513.520516]  __schedule+0x5ce/0xd00
  [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.520525]  schedule+0x46/0xf0
  [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
  [162513.520548]  ? finish_wait+0x90/0x90
  [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
  [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
  [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
  [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
  [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
  [162513.520643]  ? do_sigaction+0xf3/0x240
  [162513.520645]  ? find_held_lock+0x32/0x90
  [162513.520648]  ? do_sigaction+0xf3/0x240
  [162513.520651]  ? lock_acquired+0x33b/0x470
  [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
  [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
  [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
  [162513.520662]  ? do_sigaction+0xf3/0x240
  [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
  [162513.520672]  __x64_sys_ioctl+0x83/0xb0
  [162513.520677]  do_syscall_64+0x33/0x80
  [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.520681] RIP: 0033:0x7fc3cd307d87
  [162513.520682] Code: Bad RIP value.
  [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
  [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
  [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
  [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
  [162513.520703]
		  Showing all locks held in the system:
  [162513.520712] 1 lock held by khungtaskd/54:
  [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
  [162513.520728] 1 lock held by in:imklog/596:
  [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
  [162513.520782] 1 lock held by btrfs-transacti/1356167:
  [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
  [162513.520798] 1 lock held by btrfs/1356190:
  [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
  [162513.520805] 1 lock held by fsstress/1356184:
  [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520811] 3 locks held by fsstress/1356185:
  [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
  [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520833] 1 lock held by fsstress/1356196:
  [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520838] 3 locks held by fsstress/1356197:
  [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
  [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520858] 2 locks held by btrfs/1356211:
  [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
  [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

  >>> t = find_task(prog, 1356190)
  >>> prog.stack_trace(t)
  #0  __schedule+0x5ce/0xcfc
  #1  schedule+0x46/0xe4
  #2  schedule_timeout+0x1df/0x475
  #3  btrfs_reada_wait+0xda/0x132
  #4  scrub_stripe+0x2a8/0x112f
  #5  scrub_chunk+0xcd/0x134
  #6  scrub_enumerate_chunks+0x29e/0x5ee
  #7  btrfs_scrub_dev+0x2d5/0x91b
  #8  btrfs_ioctl+0x7f5/0x36e7
  #9  __x64_sys_ioctl+0x83/0xb0
  #10 do_syscall_64+0x33/0x77
  #11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
        if (!atomic_read(&fs_info->reada_works_cnt))
            reada_start_machine(fs_info);
        wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                          (HZ + 9) / 10);
    }
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

  $ cat dump_reada.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  def dump_re(re):
      nzones = re.nzones.value_()
      print(f're at {hex(re.value_())}')
      print(f'\t logical {re.logical.value_()}')
      print(f'\t refcnt {re.refcnt.value_()}')
      print(f'\t nzones {nzones}')
      for i in range(nzones):
          dev = re.zones[i].device
          name = dev.name.str.string_()
          print(f'\t\t dev id {dev.devid.value_()} name {name}')
      print()

  for _, e in radix_tree_for_each(fs_info.reada_tree):
      re = cast('struct reada_extent *', e)
      dump_re(re)

  $ drgn dump_reada.py
  re at 0xffff8f3da9d25ad8
          logical 38928384
          refcnt 1
          nzones 1
                 dev id 0 name b'/dev/sdd'
  $

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

  $ cat dump_block_groups.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  BTRFS_BLOCK_GROUP_DATA = (1 << 0)
  BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
  BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
  BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
  BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
  BTRFS_BLOCK_GROUP_DUP = (1 << 5)
  BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
  BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
  BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
  BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
  BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

  def bg_flags_string(bg):
      flags = bg.flags.value_()
      ret = ''
      if flags & BTRFS_BLOCK_GROUP_DATA:
          ret = 'data'
      if flags & BTRFS_BLOCK_GROUP_METADATA:
          if len(ret) > 0:
              ret += '|'
          ret += 'meta'
      if flags & BTRFS_BLOCK_GROUP_SYSTEM:
          if len(ret) > 0:
              ret += '|'
          ret += 'system'
      if flags & BTRFS_BLOCK_GROUP_RAID0:
          ret += ' raid0'
      elif flags & BTRFS_BLOCK_GROUP_RAID1:
          ret += ' raid1'
      elif flags & BTRFS_BLOCK_GROUP_DUP:
          ret += ' dup'
      elif flags & BTRFS_BLOCK_GROUP_RAID10:
          ret += ' raid10'
      elif flags & BTRFS_BLOCK_GROUP_RAID5:
          ret += ' raid5'
      elif flags & BTRFS_BLOCK_GROUP_RAID6:
          ret += ' raid6'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
          ret += ' raid1c3'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
          ret += ' raid1c4'
      else:
          ret += ' single'

      return ret

  def dump_bg(bg):
      print()
      print(f'block group at {hex(bg.value_())}')
      print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
      print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

  bg_root = fs_info.block_group_cache_tree.address_of_()
  for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
      dump_bg(bg)

  $ drgn dump_block_groups.py

  block group at 0xffff8f3d673b0400
         start 22020096 length 16777216
         flags 258 - system raid6

  block group at 0xffff8f3d53ddb400
         start 38797312 length 536870912
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4d9c00
         start 575668224 length 2147483648
         flags 257 - data raid6

  block group at 0xffff8f3d08189000
         start 2723151872 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3db70ff000
         start 2790260736 length 1073741824
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4dd800
         start 3864002560 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3d67037000
         start 3931111424 length 2147483648
         flags 257 - data raid6
  $

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
   device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
   starting to scrub a stripe from block group X. This call to
   btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
   calls reada_add_block(), it passes the logical address of the extent
   tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
   It finds there isn't any existing readahead extent for the logical
   address 38928384, so it proceeds to the path of creating a new one.

   It calls btrfs_map_block() to find out which stripes exist for the block
   group X. On the first iteration of the for loop that iterates over the
   stripes, it finds the stripe for device /dev/sdd, so it creates one
   zone for that device and adds it to the readahead extent. Before getting
   into the second iteration of the loop, the cleanup kthread deletes block
   group X because it was empty. So in the iterations for the remaining
   stripes it does not add more zones to the readahead extent, because the
   calls to reada_find_zone() returned NULL because they couldn't find
   block group X anymore.

   As a result the new readahead extent has a single zone, corresponding to
   the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
   for the readahead work queue, task B finishes the device replace and at
   btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
   device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
   "->elems" of the reada_control structure allocated at btrfs_reada_add().

   Then it returns back to btrfs_reada_add() and calls
   reada_start_machine(). This queues a job in the readahead work queue to
   run the function reada_start_machine_worker(), which calls
   __reada_start_machine().

   At __reada_start_machine() we take the device list mutex and for each
   device found in the current device list, we call
   reada_start_machine_dev() to start the readahead work. However at this
   point the device /dev/sdd was already freed and is not in the device
   list anymore.

   This means the corresponding readahead for the extent at 38928384 is
   never started, and therefore the "->elems" counter of the reada_control
   structure allocated at btrfs_reada_add() never goes down to 0, causing
   the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
   and the corresponding btrfs_device structure is freed. However the
   readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
   current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
   device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
   __readahead_hook() which ends up dropping the last reference on the
   readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
   and attempts to delete an element from the device's 'reada_extents'
   radix tree, resulting in a use-after-free, as the device pointer of the
   zone for /dev/sdd is now stale. We can also access the device after
   dropping the last reference of a zone, through reada_zone_release(),
   also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Filipe Manana
dfda50e882 btrfs: fix use-after-free on readahead extent after failure to create it
commit 83bc1560e02e25c6439341352024ebe8488f4fbd upstream.

If we fail to find suitable zones for a new readahead extent, we end up
leaving a stale pointer in the global readahead extents radix tree
(fs_info->reada_tree), which can trigger the following trace later on:

  [13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
  [13367.696802] #PF: supervisor read access in kernel mode
  [13367.697249] #PF: error_code(0x0000) - not-present page
  [13367.697721] PGD 0 P4D 0
  [13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G        W         5.9.0-rc6-btrfs-next-69 #1
  [13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
  [13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
  [13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
  [13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
  [13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
  [13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
  [13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
  [13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
  [13367.704880] FS:  00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
  [13367.705438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
  [13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [13367.707686] Call Trace:
  [13367.708246]  ? ___slab_alloc+0x395/0x740
  [13367.708820]  ? reada_add_block+0xae/0xee0 [btrfs]
  [13367.709383]  lock_acquire+0xb1/0x480
  [13367.709955]  ? reada_add_block+0xe0/0xee0 [btrfs]
  [13367.710537]  ? reada_add_block+0xae/0xee0 [btrfs]
  [13367.711097]  ? rcu_read_lock_sched_held+0x5d/0x90
  [13367.711659]  ? kmem_cache_alloc_trace+0x8d2/0x990
  [13367.712221]  ? lock_acquired+0x33b/0x470
  [13367.712784]  _raw_spin_lock+0x34/0x80
  [13367.713356]  ? reada_add_block+0xe0/0xee0 [btrfs]
  [13367.713966]  reada_add_block+0xe0/0xee0 [btrfs]
  [13367.714529]  ? btrfs_root_node+0x15/0x1f0 [btrfs]
  [13367.715077]  btrfs_reada_add+0x117/0x170 [btrfs]
  [13367.715620]  scrub_stripe+0x21e/0x10d0 [btrfs]
  [13367.716141]  ? kvm_sched_clock_read+0x5/0x10
  [13367.716657]  ? __lock_acquire+0x41e/0x3970
  [13367.717184]  ? scrub_chunk+0x60/0x140 [btrfs]
  [13367.717697]  ? find_held_lock+0x32/0x90
  [13367.718254]  ? scrub_chunk+0x60/0x140 [btrfs]
  [13367.718773]  ? lock_acquired+0x33b/0x470
  [13367.719278]  ? scrub_chunk+0xcd/0x140 [btrfs]
  [13367.719786]  scrub_chunk+0xcd/0x140 [btrfs]
  [13367.720291]  scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
  [13367.720787]  ? finish_wait+0x90/0x90
  [13367.721281]  btrfs_scrub_dev+0x1ee/0x620 [btrfs]
  [13367.721762]  ? rcu_read_lock_any_held+0x8e/0xb0
  [13367.722235]  ? preempt_count_add+0x49/0xa0
  [13367.722710]  ? __sb_start_write+0x19b/0x290
  [13367.723192]  btrfs_ioctl+0x7f5/0x36f0 [btrfs]
  [13367.723660]  ? __fget_files+0x101/0x1d0
  [13367.724118]  ? find_held_lock+0x32/0x90
  [13367.724559]  ? __fget_files+0x101/0x1d0
  [13367.724982]  ? __x64_sys_ioctl+0x83/0xb0
  [13367.725399]  __x64_sys_ioctl+0x83/0xb0
  [13367.725802]  do_syscall_64+0x33/0x80
  [13367.726188]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [13367.726574] RIP: 0033:0x7fe1df7add87
  [13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
  [13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
  [13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
  [13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
  [13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
  [13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
  [13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
  [13367.732638] CR2: 00000000000000b0
  [13367.733166] ---[ end trace d298b6805556acd9 ]---

What happens is the following:

1) At reada_find_extent() we don't find any existing readahead extent for
   the metadata extent starting at logical address X;

2) So we proceed to create a new one. We then call btrfs_map_block() to get
   information about which stripes contain extent X;

3) After that we iterate over the stripes and create only one zone for the
   readahead extent - only one because reada_find_zone() returned NULL for
   all iterations except for one, either because a memory allocation failed
   or it couldn't find the block group of the extent (it may have just been
   deleted);

4) We then add the new readahead extent to the readahead extents radix
   tree at fs_info->reada_tree;

5) Then we iterate over each zone of the new readahead extent, and find
   that the device used for that zone no longer exists, because it was
   removed or it was the source device of a device replace operation.
   Since this left 'have_zone' set to 0, after finishing the loop we jump
   to the 'error' label, call kfree() on the new readahead extent and
   return without removing it from the radix tree at fs_info->reada_tree;

6) Any future call to reada_find_extent() for the logical address X will
   find the stale pointer in the readahead extents radix tree, increment
   its reference counter, which can trigger the use-after-free right
   away or return it to the caller reada_add_block() that results in the
   use-after-free of the example trace above.

So fix this by making sure we delete the readahead extent from the radix
tree if we fail to setup zones for it (when 'have_zone = 0').

Fixes: 319450211842ba ("btrfs: reada: bypass adding extent when all zone failed")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Daniel Xu
834a61b212 btrfs: tree-checker: validate number of chunk stripes and parity
commit 85d07fbe09efd1c529ff3e025e2f0d2c6c96a1b7 upstream.

If there's no parity and num_stripes < ncopies, a crafted image can
trigger a division by zero in calc_stripe_length().

The image was generated through fuzzing.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209587
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Josef Bacik
1cedc54ad3 btrfs: cleanup cow block on error
commit 572c83acdcdafeb04e70aa46be1fa539310be20c upstream.

In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:

  $ cat /proc/2143497/stack
  [<0>] __btrfs_tree_lock+0x108/0x250
  [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
  [<0>] btree_write_cache_pages+0x15a/0x3b0
  [<0>] do_writepages+0x28/0xb0
  [<0>] __writeback_single_inode+0x54/0x5c0
  [<0>] writeback_sb_inodes+0x1e8/0x510
  [<0>] wb_writeback+0xcc/0x440
  [<0>] wb_workfn+0xd7/0x650
  [<0>] process_one_work+0x236/0x560
  [<0>] worker_thread+0x55/0x3c0
  [<0>] kthread+0x13a/0x150
  [<0>] ret_from_fork+0x1f/0x30

This is because we got an error while COWing a block, specifically here

        if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                if (ret) {
                        btrfs_abort_transaction(trans, ret);
                        return ret;
                }
        }

  [16402.241552] BTRFS: Transaction aborted (error -2)
  [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
  [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
  [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
  [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
  [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
  [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
  [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
  [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
  [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
  [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
  [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
  [16402.257867] Call Trace:
  [16402.258072]  btrfs_cow_block+0x109/0x230
  [16402.258356]  btrfs_search_slot+0x530/0x9d0
  [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
  [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
  [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
  [16402.259949]  btrfs_replace_file_extents+0x190/0x820
  [16402.260873]  btrfs_clone+0x9ae/0xc00
  [16402.261139]  btrfs_extent_same_range+0x66/0x90
  [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
  [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
  [16402.262821]  vfs_dedupe_file_range+0x189/0x220
  [16402.263150]  do_vfs_ioctl+0x552/0x700
  [16402.263662]  __x64_sys_ioctl+0x62/0xb0
  [16402.264023]  do_syscall_64+0x33/0x40
  [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [16402.264862] RIP: 0033:0x7f542c7d15cb
  [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
  [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
  [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
  [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
  [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
  [16402.271498] irq event stamp: 0
  [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
  [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
  [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [16402.274338] ---[ end trace 737874a5a41a8236 ]---
  [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.276179] BTRFS info (device dm-9): forced readonly
  [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
  [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30

The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode.  This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.

Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Qu Wenruo
d3ce2d0fb8 btrfs: tree-checker: fix false alert caused by legacy btrfs root item
commit 1465af12e254a68706e110846f59cf0f09683184 upstream.

Commit 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
introduced btrfs root item size check, however btrfs root item has two
versions, the legacy one which just ends before generation_v2 member, is
smaller than current btrfs root item size.

This caused btrfs kernel to reject valid but old tree root leaves.

Fix this problem by also allowing legacy root item, since kernel can
already handle them pretty well and upgrade to newer root item format
when needed.

Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Denis Efremov
4b82b8aba0 btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send()
commit 8eb2fd00153a3a96a19c62ac9c6d48c2efebe5e8 upstream.

btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.

[1] https://patchwork.kernel.org/patch/9757891/#20529627

Fixes: 818e010bf9d0 ("btrfs: replace opencoded kvzalloc with the helper")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Filipe Manana
6ec4b82fc3 btrfs: send, recompute reference path after orphanization of a directory
commit 9c2b4e0347067396ceb3ae929d6888c81d610259 upstream.

During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.

Example reproducer:

  $ cat reproducer.sh
  #!/bin/bash

  mkfs.btrfs -f /dev/sdi >/dev/null
  mount /dev/sdi /mnt/sdi

  touch /mnt/sdi/f1
  touch /mnt/sdi/f2
  mkdir /mnt/sdi/d1
  mkdir /mnt/sdi/d1/d2

  # Filesystem looks like:
  #
  # .                           (ino 256)
  # |----- f1                   (ino 257)
  # |----- f2                   (ino 258)
  # |----- d1/                  (ino 259)
  #        |----- d2/           (ino 260)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
  btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

  # Now do a series of changes such that:
  #
  # *) inode 258 has one new hardlink and the previous name changed
  #
  # *) both names conflict with the old names of two other inodes:
  #
  #    1) the new name "d1" conflicts with the old name of inode 259,
  #       under directory inode 256 (root)
  #
  #    2) the new name "d2" conflicts with the old name of inode 260
  #       under directory inode 259
  #
  # *) inodes 259 and 260 now have the old names of inode 258
  #
  # *) inode 257 is now located under inode 260 - an inode with a number
  #    smaller than the inode (258) for which we created a second hard
  #    link and swapped its names with inodes 259 and 260
  #
  ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
  mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1

  # Swap d1 and f2.
  mv /mnt/sdi/d1 /mnt/sdi/tmp
  mv /mnt/sdi/f2 /mnt/sdi/d1
  mv /mnt/sdi/tmp /mnt/sdi/f2

  # Swap d2 and f2_link
  mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
  mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
  mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link

  # Filesystem now looks like:
  #
  # .                                (ino 256)
  # |----- d1                        (ino 258)
  # |----- f2/                       (ino 259)
  #        |----- f2_link/           (ino 260)
  #        |       |----- f1         (ino 257)
  #        |
  #        |----- d2                 (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
  btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

  mkfs.btrfs -f /dev/sdj >/dev/null
  mount /dev/sdj /mnt/sdj

  btrfs receive -f /tmp/snap1.send /mnt/sdj
  btrfs receive -f /tmp/snap2.send /mnt/sdj

  umount /mnt/sdi
  umount /mnt/sdj

When executed the receive of the incremental stream fails:

  $ ./reproducer.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory

This happens because:

1) When processing inode 257 we end up computing the name for inode 259
   because it is an ancestor in the send snapshot, and at that point it
   still has its old name, "d1", from the parent snapshot because inode
   259 was not yet processed. We then cache that name, which is valid
   until we start processing inode 259 (or set the progress to 260 after
   processing its references);

2) Later we start processing inode 258 and collecting all its new
   references into the list sctx->new_refs. The first reference in the
   list happens to be the reference for name "d1" while the reference for
   name "d2" is next (the last element of the list).
   We compute the full path "d1/d2" for this second reference and store
   it in the reference (its ->full_path member). The path used for the
   new parent directory was "d1" and not "f2" because inode 259, the
   new parent, was not yet processed;

3) When we start processing the new references at process_recorded_refs()
   we start with the first reference in the list, for the new name "d1".
   Because there is a conflicting inode that was not yet processed, which
   is directory inode 259, we orphanize it, renaming it from "d1" to
   "o259-6-0";

4) Then we start processing the new reference for name "d2", and we
   realize it conflicts with the reference of inode 260 in the parent
   snapshot. So we issue an orphanization operation for inode 260 by
   emitting a rename operation with a destination path of "o260-6-0"
   and a source path of "d1/d2" - this source path is the value we
   stored in the reference earlier at step 2), corresponding to the
   ->full_path member of the reference, however that path is no longer
   valid due to the orphanization of the directory inode 259 in step 3).
   This makes the receiver fail since the path does not exists, it should
   have been "o259-6-0/d2".

Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Filipe Manana
c2dcc9b03b btrfs: send, orphanize first all conflicting inodes when processing references
commit 98272bb77bf4cc20ed1ffca89832d713e70ebf09 upstream.

When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.

The following reproducer triggers such scenario:

  $ cat reproducer.sh
  #!/bin/bash

  mkfs.btrfs -f /dev/sdi >/dev/null
  mount /dev/sdi /mnt/sdi

  touch /mnt/sdi/a
  touch /mnt/sdi/b
  mkdir /mnt/sdi/testdir
  # We want "a" to have a lower inode number then "testdir" (257 vs 259).
  mv /mnt/sdi/a /mnt/sdi/testdir/a

  # Filesystem looks like:
  #
  # .                           (ino 256)
  # |----- testdir/             (ino 259)
  # |          |----- a         (ino 257)
  # |
  # |----- b                    (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
  btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

  # Now rename 259 to "testdir_2", then change the name of 257 to
  # "testdir" and make it a direct descendant of the root inode (256).
  # Also create a new link for inode 257 with the old name of inode 258.
  # By swapping the names and location of several inodes and create a
  # nasty dependency chain of rename and link operations.
  mv /mnt/sdi/testdir/a /mnt/sdi/a2
  touch /mnt/sdi/testdir/a
  mv /mnt/sdi/b /mnt/sdi/b2
  ln /mnt/sdi/a2 /mnt/sdi/b
  mv /mnt/sdi/testdir /mnt/sdi/testdir_2
  mv /mnt/sdi/a2 /mnt/sdi/testdir

  # Filesystem now looks like:
  #
  # .                            (ino 256)
  # |----- testdir_2/            (ino 259)
  # |          |----- a          (ino 260)
  # |
  # |----- testdir               (ino 257)
  # |----- b                     (ino 257)
  # |----- b2                    (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
  btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

  mkfs.btrfs -f /dev/sdj >/dev/null
  mount /dev/sdj /mnt/sdj

  btrfs receive -f /tmp/snap1.send /mnt/sdj
  btrfs receive -f /tmp/snap2.send /mnt/sdj

  umount /mnt/sdi
  umount /mnt/sdj

When running the reproducer, the receive of the incremental send stream
fails:

  $ ./reproducer.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: link b -> o259-6-0/a failed: No such file or directory

The problem happens because of the following:

1) Before we start iterating the list of new references for inode 257,
   we generate its current path and store it at @valid_path, done at
   the very beginning of process_recorded_refs(). The generated path
   is "o259-6-0/a", containing the orphanized name for inode 259;

2) Then we iterate over the list of new references, which has the
   references "b" and "testdir" in that specific order;

3) We process reference "b" first, because it is in the list before
   reference "testdir". We then issue a link operation to create
   the new reference "b" using a target path corresponding to the
   content at @valid_path, which corresponds to "o259-6-0/a".
   However we haven't yet orphanized inode 259, its name is still
   "testdir", and not "o259-6-0". The orphanization of 259 did not
   happen yet because we will process the reference named "testdir"
   for inode 257 only in the next iteration of the loop that goes
   over the list of new references.

Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Filipe Manana
e1cf034899 btrfs: reschedule if necessary when logging directory items
commit bb56f02f26fe23798edb1b2175707419b28c752a upstream.

Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.

Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.

So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.

The stack trace produced when the soft lockup happens is the following:

[10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
[10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
[10480.284915] irq event stamp: 29646366
[10480.285987] hardirqs last  enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
[10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
[10480.290856] softirqs last  enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
[10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
[10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
[10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
[10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
[10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
[10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
[10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
[10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
[10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
[10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
[10480.321104] FS:  00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
[10480.323559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
[10480.327259] Call Trace:
[10480.328286]  ? overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.329784]  __kmalloc+0x831/0xa20
[10480.331009]  ? btrfs_get_32+0xb0/0x1d0 [btrfs]
[10480.332464]  overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.333948]  log_dir_items+0x2ee/0x570 [btrfs]
[10480.335413]  log_directory_changes+0x82/0xd0 [btrfs]
[10480.336926]  btrfs_log_inode+0xc9b/0xda0 [btrfs]
[10480.338374]  ? init_once+0x20/0x20 [btrfs]
[10480.339711]  btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
[10480.341257]  ? dget_parent+0x97/0x2e0
[10480.342480]  btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
[10480.343977]  btrfs_sync_file+0x24b/0x5e0 [btrfs]
[10480.345381]  do_fsync+0x38/0x70
[10480.346483]  __x64_sys_fsync+0x10/0x20
[10480.347703]  do_syscall_64+0x2d/0x70
[10480.348891]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10480.350444] RIP: 0033:0x7fa1dc80970b
[10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
[10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
[10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
[10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
[10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
[10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60

Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
CC: stable@vger.kernel.org # 4.4+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Anand Jain
223b462744 btrfs: improve device scanning messages
commit 79dae17d8d44b2d15779e332180080af45df5352 upstream.

Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.

The current message is too long, so drop the unnecessary UUID and add
process name and PID.

While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent

CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Qu Wenruo
c5f2a50912 btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode
commit b4c5d8fdfff3e2b6c4fa4a5043e8946dff500f8c upstream.

For delayed inode facility, qgroup metadata is reserved for it, and
later freed.

However we're freeing more bytes than we reserved.
In btrfs_delayed_inode_reserve_metadata():

	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
	...
		ret = btrfs_qgroup_reserve_meta_prealloc(root,
				fs_info->nodesize, true);
		...
		if (!ret) {
			node->bytes_reserved = num_bytes;

But in btrfs_delayed_inode_release_metadata():

	if (qgroup_free)
		btrfs_qgroup_free_meta_prealloc(node->root,
				node->bytes_reserved);
	else
		btrfs_qgroup_convert_reserved_meta(node->root,
				node->bytes_reserved);

This means, we're always releasing more qgroup metadata rsv than we have
reserved.

This won't trigger selftest warning, as btrfs qgroup metadata rsv has
extra protection against cases like quota enabled half-way.

But we still need to fix this problem any way.

This patch will use the same num_bytes for qgroup metadata rsv so we
could handle it correctly.

Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Anand Jain
4ebdad0512 btrfs: fix replace of seed device
[ Upstream commit c6a5d954950c5031444173ad2195efc163afcac9 ]

If you replace a seed device in a sprouted fs, it appears to have
successfully replaced the seed device, but if you look closely, it
didn't.  Here is an example.

  $ mkfs.btrfs /dev/sda
  $ btrfstune -S1 /dev/sda
  $ mount /dev/sda /btrfs
  $ btrfs device add /dev/sdb /btrfs
  $ umount /btrfs
  $ btrfs device scan --forget
  $ mount -o device=/dev/sda /dev/sdb /btrfs
  $ btrfs replace start -f /dev/sda /dev/sdc /btrfs
  $ echo $?
  0

  BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
  BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished

  $ btrfs fi show
  Label: none  uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
	  Total devices 2 FS bytes used 256.00KiB
	  devid    1 size 3.00GiB used 520.00MiB path /dev/sdc
	  devid    2 size 3.00GiB used 896.00MiB path /dev/sdb

  Label: none  uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
	  Total devices 1 FS bytes used 128.00KiB
	  devid    1 size 3.00GiB used 536.00MiB path /dev/sda

So as per the replace start command and kernel log replace was successful.
Now let's try to clean mount.

  $ umount /btrfs
  $ btrfs device scan --forget

  $ mount -o device=/dev/sdc /dev/sdb /btrfs
  mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.

  [  636.157517] BTRFS error (device sdc): failed to read chunk tree: -2
  [  636.180177] BTRFS error (device sdc): open_ctree failed

That's because per dev items it is still looking for the original seed
device.

 $ btrfs inspect-internal dump-tree -d /dev/sdb

	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
		devid 1 total_bytes 3221225472 bytes_used 545259520
		io_align 4096 io_width 4096 sector_size 4096 type 0
		generation 6 start_offset 0 dev_group 0
		seek_speed 0 bandwidth 0
		uuid 59368f50-9af2-4b17-91da-8a783cc418d4  <--- seed uuid
		fsid 10bd3202-0415-43af-96a8-d5409f310a7e  <--- seed fsid
	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
		devid 2 total_bytes 3221225472 bytes_used 939524096
		io_align 4096 io_width 4096 sector_size 4096 type 0
		generation 0 start_offset 0 dev_group 0
		seek_speed 0 bandwidth 0
		uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a  <- sprout uuid
		fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid

But the replaced target has the following uuid+fsid in its superblock
which doesn't match with the expected uuid+fsid in its devitem.

  $ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid'
  generation	20
  dev_item.uuid	59368f50-9af2-4b17-91da-8a783cc418d4
  dev_item.fsid	ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match]
  dev_item.devid	1

So if you provide the original seed device the mount shall be
successful.  Which so long happening in the test case btrfs/163.

  $ btrfs device scan --forget
  $ mount -o device=/dev/sda /dev/sdb /btrfs

Fix in this patch:
If a seed is not sprouted then there is no replacement of it, because of
its read-only filesystem with a read-only device. Similarly, in the case
of a sprouted filesystem, the seed device is still read only. So, mark
it as you can't replace a seed device, you can only add a new device and
then delete the seed device. If replace is attempted then returns
-EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-05 11:43:19 +01:00
Josef Bacik
1bf467fdfe btrfs: take overcommit into account in inc_block_group_ro
commit a30a3d2067536cbcce26c055e70cc3a6ae4fd45c upstream

inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.

The problem is this calculation _only_ works for data, where our used is
always less than our total.  For metadata we will overcommit, so this
will almost always fail for metadata.

Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only.  If we do then we can mark this block
group as read only.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-17 10:11:21 +02:00
Josef Bacik
39c5eb1482 btrfs: don't pass system_chunk into can_overcommit
commit 9f246926b4d5db4c5e8c78e4897757de26c95be6 upstream

We have the space_info, we can just check its flags to see if it's the
system chunk space info.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-17 10:11:21 +02:00
Qu Wenruo
c0f3c53869 btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation
commit 6d4572a9d71d5fc2affee0258d8582d39859188c upstream.

[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.

The following script can reproduce it pretty easily:
  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  umount $dev &> /dev/null
  umount $mnt &> /dev/null

  mkfs.btrfs -f $dev -b 1G
  mount -o nospace_cache $dev $mnt
  touch $mnt/foobar
  chattr +C $mnt/foobar

  xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
  xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
  sync

  xfs_io -c "fpunch 0 2k" $mnt/foobar
  umount $mnt

Currently this will fail at the fpunch part.

[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.

Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().

[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.

Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.

This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.

Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 10:33:00 +02:00
Filipe Manana
e531fd7f8b btrfs: fix RWF_NOWAIT write not failling when we need to cow
commit 260a63395f90f67d6ab89e4266af9e3dc34a77e9 upstream.

If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/sdj/bar
  $ chattr +C /mnt/sdj/bar

  $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
  wrote 262144/262144 bytes at offset 0
  256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)

  $ xfs_io -c "fpunch 64K 64K" /mnt/bar
  $ sync

  $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
  wrote 131072/131072 bytes at offset 0
  128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)

This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.

Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.

Fixes: edf064e7c6fec3 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 10:33:00 +02:00
Qu Wenruo
1f90600e25 btrfs: Ensure we trim ranges across block group boundary
commit 6b7faadd985c990324b5b5bd18cc4ba5c395eb65 upstream.

[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:

  btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type:		bbio->map_type, in above case, it's SINGLE DATA.
start:		Logical address of this trim
len:		Logical length of this trim
trimmed:	Physically trimmed bytes
ratio:		trimmed / len

Thus leaving some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
   |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
   |- btrfs_discard_extent()

However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:

      |<---    BG1    --->|<---      BG2     --->|
      |//////|<--   Range to discard   --->|/////|

To discard that range, we have the following calls:

  btrfs_discard_extent()
  |- btrfs_map_block()
  |  Returned bbio will end at BG1's end. As btrfs_map_block()
  |  never returns result across block group boundary.
  |- btrfs_issuse_discard()
     Issue discard for each stripe.

So we will only discard the range in BG1, not the remaining part in BG2.

Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
  btrfs_map_block() uses its @length paramter to notify the caller how
  many bytes are mapped in current call.
  With __btrfs_map_block_for_discard() also modifing the @length,
  btrfs_discard_extent() now understands when to do extra trim.

- Call btrfs_map_block() in a loop until we hit the range end Since we
  now know how many bytes are mapped each time, we can iterate through
  each block group boundary and issue correct trim for each range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 10:33:00 +02:00