Commit Graph

145 Commits

Author SHA1 Message Date
c875ac0a43 stable/btrfs: fix backport bug in d819d97ea0 ("btrfs: honor path->skip_locking in backref code")
Upstream commit 38e3eebff6 ("btrfs: honor path->skip_locking in
backref code") was incorrectly backported to 4.14.y . It misses removal
of two lines from original commit, what cause deadlock.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203993
Reported-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr>
Fixes: d819d97ea0 ("btrfs: honor path->skip_locking in backref code")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10 09:54:43 +02:00
d819d97ea0 btrfs: honor path->skip_locking in backref code
commit 38e3eebff6 upstream.

Qgroups will do the old roots lookup at delayed ref time, which could be
while walking down the extent root while running a delayed ref.  This
should be fine, except we specifically lock eb's in the backref walking
code irrespective of path->skip_locking, which deadlocks the system.
Fix up the backref code to honor path->skip_locking, nobody will be
modifying the commit_root when we're searching so it's completely safe
to do.

This happens since fb235dc06f ("btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans"), kernel may lockup with quota
enabled.

There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume.  The example can be reliably
reproduced:

  btrfs-cleaner   D    0  4062      2 0x80000000
  Call Trace:
   schedule+0x32/0x90
   btrfs_tree_read_lock+0x93/0x130 [btrfs]
   find_parent_nodes+0x29b/0x1170 [btrfs]
   btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
   btrfs_find_all_roots+0x57/0x70 [btrfs]
   btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
   btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
   btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
   do_walk_down+0x541/0x5e3 [btrfs]
   walk_down_tree+0xab/0xe7 [btrfs]
   btrfs_drop_snapshot+0x356/0x71a [btrfs]
   btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
   cleaner_kthread+0x12b/0x160 [btrfs]
   kthread+0x112/0x130
   ret_from_fork+0x27/0x50

When dropping snapshots with qgroup enabled, we will trigger backref
walk.

However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.

For example:

           FS 260     FS 261 (Dropped)
            node A        node B
           /      \      /      \
       node C      node D      node E
      /   \         /  \        /     \
  leaf F|leaf G|leaf H|leaf I|leaf J|leaf K

The lock sequence would be:

      Thread A (cleaner)             |       Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B)                        |
write_lock(D)                        |
^^^ called by walk_down_tree()       |
                                     |       write_lock(A)
                                     |       write_lock(D) << Stall
read_lock(H) << for backref walk     |
read_lock(D) << lock owner is        |
                the same thread A    |
                so read lock is OK   |
read_lock(A) << Stall                |

So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.

This will cause a deadlock.

This is not only limited to snapshot dropping case.  As the backref
walk, even only happens on commit trees, is breaking the normal top-down
locking order, makes it deadlock prone.

Fixes: fb235dc06f ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: David Sterba <dsterba@suse.com>
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ rebase to latest branch and fix lock assert bug in btrfs/007 ]
[ backport to linux-4.19.y branch, solve minor conflicts ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy logs and deadlock analysis from Qu's patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31 06:47:12 -07:00
f53282d72f Btrfs: do not start a transaction at iterate_extent_inodes()
commit bfc61c3626 upstream.

When finding out which inodes have references on a particular extent, done
by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
v1 and v2) ioctl and from scrub we use the transaction join API to grab a
reference on the currently running transaction, since in order to give
accurate results we need to inspect the delayed references of the currently
running transaction.

However, if there is currently no running transaction, the join operation
will create a new transaction. This is inefficient as the transaction will
eventually be committed, doing unnecessary IO and introducing a potential
point of failure that will lead to a transaction abort due to -ENOSPC, as
recently reported [1].

That's because the join, creates the transaction but does not reserve any
space, so when attempting to update the root item of the root passed to
btrfs_join_transaction(), during the transaction commit, we can end up
failling with -ENOSPC. Users of a join operation are supposed to actually
do some filesystem changes and reserve space by some means, which is not
the case of iterate_extent_inodes(), it is a read-only operation for all
contextes from which it is called.

The reported [1] -ENOSPC failure stack trace is the following:

 heisenberg kernel: ------------[ cut here ]------------
 heisenberg kernel: BTRFS: Transaction aborted (error -28)
 heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
 heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
 heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
 heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
 heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 heisenberg kernel: Call Trace:
 heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
 heisenberg kernel:  ? _cond_resched+0x15/0x30
 heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
 heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
 heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
 heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
 heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
 heisenberg kernel:  kthread+0x112/0x130
 heisenberg kernel:  ? kthread_bind+0x30/0x30
 heisenberg kernel:  ret_from_fork+0x35/0x40
 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---

So fix that by using the attach API, which does not create a transaction
when there is currently no running transaction.

[1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 18:50:18 +02:00
91f70f3831 Btrfs: do not start a transaction during fiemap
commit 03628cdbc6 upstream.

During fiemap, for regular extents (non inline) we need to check if they
are shared and if they are, set the shared bit. Checking if an extent is
shared requires checking the delayed references of the currently running
transaction, since some reference might have not yet hit the extent tree
and be only in the in-memory delayed references.

However we were using a transaction join for this, which creates a new
transaction when there is no transaction currently running. That means
that two more potential failures can happen: creating the transaction and
committing it. Further, if no write activity is currently happening in the
system, and fiemap calls keep being done, we end up creating and
committing transactions that do nothing.

In some extreme cases this can result in the commit of the transaction
created by fiemap to fail with ENOSPC when updating the root item of a
subvolume tree because a join does not reserve any space, leading to a
trace like the following:

 heisenberg kernel: ------------[ cut here ]------------
 heisenberg kernel: BTRFS: Transaction aborted (error -28)
 heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
 heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
 heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
 heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
 heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 heisenberg kernel: Call Trace:
 heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
 heisenberg kernel:  ? _cond_resched+0x15/0x30
 heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
 heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
 heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
 heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
 heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
 heisenberg kernel:  kthread+0x112/0x130
 heisenberg kernel:  ? kthread_bind+0x30/0x30
 heisenberg kernel:  ret_from_fork+0x35/0x40
 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---

Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
a transaction join to avoid the overhead of creating a new transaction (if
there is currently no running transaction) and introducing a potential
point of failure when the new transaction gets committed, instead use a
transaction attach to grab a handle for the currently running transaction
if any.

Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Fixes: afce772e87 ("btrfs: fix check_shared for fiemap ioctl")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 18:50:18 +02:00
d35115930d btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
commit c8195a7b1a upstream.

Until v4.14, this warning was very infrequent:

	WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
	Modules linked in: [...]
	CPU: 3 PID: 18172 Comm: bees Tainted: G      D W    L  4.11.9-zb64+ #1
	Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101    12/02/2014
	Call Trace:
	 dump_stack+0x85/0xc2
	 __warn+0xd1/0xf0
	 warn_slowpath_null+0x1d/0x20
	 find_parent_nodes+0xc41/0x14e0
	 __btrfs_find_all_roots+0xad/0x120
	 ? extent_same_check_offsets+0x70/0x70
	 iterate_extent_inodes+0x168/0x300
	 iterate_inodes_from_logical+0x87/0xb0
	 ? iterate_inodes_from_logical+0x87/0xb0
	 ? extent_same_check_offsets+0x70/0x70
	 btrfs_ioctl+0x8ac/0x2820
	 ? lock_acquire+0xc2/0x200
	 do_vfs_ioctl+0x91/0x700
	 ? __fget+0x112/0x200
	 SyS_ioctl+0x79/0x90
	 entry_SYSCALL_64_fastpath+0x23/0xc6
	 ? trace_hardirqs_off_caller+0x1f/0x140

Starting with v4.14 (specifically 86d5f99442 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.

Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadolski@suse.de> on the linux-btrfs mailing list.

Fixes: 8da6d5815c ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21 12:06:44 +01:00
7e7fbff126 btrfs: add missing initialization in btrfs_check_shared
commit 18bf591ba9 upstream.

This patch addresses an issue that causes fiemap to falsely
report a shared extent.  The test case is as follows:

xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io  -c "fiemap -v" /media/scratch/file5

which gives the resulting output:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128 0x2001
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.

With proper re-initialization the test result is as follows:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

which corrects the regression.

Fixes: 3ec4d3238a ("btrfs: allow backref search checks for shared extents")
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
[ add text from cover letter to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21 12:06:44 +01:00
3de28d579e Btrfs: convert to use btrfs_get_extent_inline_ref_type
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
01747e92a9 btrfs: clean up extraneous computations in add_delayed_refs
Repeating the same computation in multiple places is not
necessary.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
3ec4d3238a btrfs: allow backref search checks for shared extents
When called with a struct share_check, find_parent_nodes()
will detect a shared extent and immediately return with
BACKREF_SHARED_FOUND.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
9dd14fd696 btrfs: add cond_resched() calls when resolving backrefs
Since backref resolution is CPU-intensive, the cond_resched calls
should help alleviate soft lockup occurences.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
00142756e1 btrfs: backref, add tracepoints for prelim_ref insertion and merging
This patch adds a tracepoint event for prelim_ref insertion and
merging.  For each, the ref being inserted or merged and the count
of tree nodes is issued.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
6c336b212b btrfs: add a node counter to each of the rbtrees
This patch adds counters to each of the rbtrees so that we can tell
how large they are growing for a given workload.  These counters
will be exported by tracepoints in the next patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
86d5f99442 btrfs: convert prelimary reference tracking to use rbtrees
It's been known for a while that the use of multiple lists
that are periodically merged was an algorithmic problem within
btrfs.  There are several workloads that don't complete in any
reasonable amount of time (e.g. btrfs/130) and others that cause
soft lockups.

The solution is to use a set of rbtrees that do insertion merging
for both indirect and direct refs, with the former converting
refs into the latter.  The result is a btrfs/130 workload that
used to take several hours now takes about half of that. This
runtime still isn't acceptable and a future patch will address that
by moving the rbtrees higher in the stack so the lookups can be
shared across multiple calls to find_parent_nodes.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:11:55 +02:00
f6954245d9 btrfs: remove ref_tree implementation from backref.c
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
the ref_tree code in backref.c to reduce backref searching for
shared extents under the FIEMAP ioctl. This code will not be
compatible with the upcoming rbtree changes for improved backref
searching, so this patch removes the ref_tree code.  The rbtree
changes will provide the equivalent functionality for FIEMAP.

The above commit also introduced transaction semantics around calls to
btrfs_check_shared() in order to accurately account for delayed refs.
This functionality needs to be retained, so a complete revert of the
above commit is not desirable. This patch therefore removes the
ref_tree portion of the commit as above, however it does not remove
the transaction portion.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
bb739cf08e btrfs: btrfs_check_shared should manage its own transaction
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
e0c476b128 btrfs: backref, cleanup __ namespace abuse
We typically use __ to indicate a helper routine that shouldn't be
called directly without understanding the proper context required
to do so.  We use static functions to indicate that a function is
private to a particular C file.  The backref code uses static
function and __ prefixes on nearly everything, which makes the code
difficult to read and establishes a pattern for future code that
shouldn't be followed.  This patch drops all the unnecessary prefixes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
4dae077a83 btrfs: backref, add unode_aux_to_inode_list helper
Replacing the double cast and ternary conditional with a helper makes
the code easier on the eyes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
73980becae btrfs: backref, constify some arguments
This constifies a few buffers used in the backref code.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
f54de068dd btrfs: use GFP_KERNEL in init_ipath
Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
de47c9d3ff btrfs: replace hardcoded value with SEQ_LAST macro
Define the SEQ_LAST macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
f58d88b336 btrfs: provide enumeration for __merge_refs mode argument
Replace hardcoded numeric values for __merge_refs 'mode' argument
with descriptive constants.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
6df8cdf5bd btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
eeac44cb49 btrfs: remove unused parameter from __add_inline_refs
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
f72ad18e99 Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly,
there is no reason to deference it again inside the function.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
3a45bb207e btrfs: remove root parameter from transaction commit/end routines
Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
2ff7e61e0d btrfs: take an fs_info directly when the root is not used otherwise
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
da17066c40 btrfs: pull node/sector/stripe sizes out of root and into fs_info
We track the node sizes per-root, but they never vary from the values
in the superblock.  This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
ab8d0fc48d btrfs: convert pr_* to btrfs_* where possible
For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:04 +02:00
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
afce772e87 btrfs: fix check_shared for fiemap ioctl
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
d8422ba334 btrfs: backref: Fix soft lockup in __merge_refs function
When over 1000 file extents refers to one extent, find_parent_nodes()
will be obviously slow, due to the O(n^2)~O(n^3) loops inside
__merge_refs().

The following ftrace shows the cubic growth of execution time:

256 refs
 5) + 91.768 us   |  __add_keyed_refs.isra.12 [btrfs]();
 5)   1.447 us    |  __add_missing_keys.isra.13 [btrfs]();
 5) ! 114.544 us  |  __merge_refs [btrfs]();
 5) ! 136.399 us  |  __merge_refs [btrfs]();

512 refs
 6) ! 279.859 us  |  __add_keyed_refs.isra.12 [btrfs]();
 6)   3.164 us    |  __add_missing_keys.isra.13 [btrfs]();
 6) ! 442.498 us  |  __merge_refs [btrfs]();
 6) # 2091.073 us |  __merge_refs [btrfs]();

and 1024 refs
 7) ! 368.683 us  |  __add_keyed_refs.isra.12 [btrfs]();
 7)   4.810 us    |  __add_missing_keys.isra.13 [btrfs]();
 7) # 2043.428 us |  __merge_refs [btrfs]();
 7) * 18964.23 us |  __merge_refs [btrfs]();

And sort them into the following char:
(Unit: us)
------------------------------------------------------------------------
 Trace function        | 256 ref        | 512 refs      | 1024 refs    |
------------------------------------------------------------------------
 __add_keyed_refs      | 91             | 249           | 368          |
 __add_missing_keys    | 1              | 3             | 4            |
 __merge_refs 1st call | 114            | 442           | 2043         |
 __merge_refs 2nd call | 136            | 2091          | 18964        |
------------------------------------------------------------------------

We can see the that __add_keyed_refs() grows almost in linear behavior.
And __add_missing_keys() in this case doesn't change much or takes much
time.

While for the 1st __merge_refs() it's square growth
for the 2nd __merge_refs() call it's cubic growth.

It's no doubt that merge_refs() will take a long long time to execute if
the number of refs continues its grows.

So add a cond_resced() into the loop of __merge_refs().

Although this will solve the problem of soft lockup, we need to use the
new rb_tree based structure introduced by Lu Fengqi to really solve the
long execution time.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:16 -07:00
f5ee5c9ac5 btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
Now that we have a dummy fs_info associated with each test that
uses a root, we don't need the DUMMY_ROOT bit anymore.  This lets
us make choices without needing an actual root like in e.g.
btrfs_find_create_tree_block.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:19 +02:00
fba4b69771 btrfs: Fix slab accounting flags
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
42f31734eb Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
0132761017 btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 22:35:14 +02:00
72928f2476 Btrfs: fix fspath error deallocation
Make sure to deallocate fspath with vfree() in case of error in
init_ipath().

fspath is allocated with vmalloc() in init_data_container() since
commit 425d17a290 ("Btrfs: use larger limit for translation of logical to
inode").

Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 16:22:26 +02:00
f004fae0cf Merge branch 'cleanups-4.6' into for-chris-4.6 2016-02-26 15:38:33 +01:00
5598e9005a btrfs: drop null testing before destroy functions
Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
8f682f6955 btrfs: remove open-coded swap() in backref.c:__merge_refs
The kernel provides a swap() that does the same thing as this code.

Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:45:55 +01:00
0c0fe3b0fa Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
While doing some tests I ran into an hang on an extent buffer's rwlock
that produced the following trace:

[39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
[39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
[39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800016] irq event stamp: 0
[39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
[39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
[39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
[39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
[39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
[39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
[39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
[39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
[39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
[39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800016] Stack:
[39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
[39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
[39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
[39389.800016] Call Trace:
[39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
[39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
[39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
[39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
[39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
[39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
[39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
[39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
[39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800012] irq event stamp: 0
[39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
[39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
[39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
[39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
[39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
[39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
[39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
[39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
[39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
[39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
[39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800012] Stack:
[39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
[39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
[39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
[39389.800012] Call Trace:
[39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
[39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
[39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
[39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
[39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
[39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
[39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
[39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
[39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
[39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
[39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
[39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
[39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
[39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
[39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
[39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
[39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00

This happens because in the code path executed by the inode_paths ioctl we
end up nesting two calls to read lock a leaf's rwlock when after the first
call to read_lock() and before the second call to read_lock(), another
task (running the delayed items as part of a transaction commit) has
already called write_lock() against the leaf's rwlock. This situation is
illustrated by the following diagram:

         Task A                       Task B

  btrfs_ref_to_path()               btrfs_commit_transaction()
    read_lock(&eb->lock);

                                      btrfs_run_delayed_items()
                                        __btrfs_commit_inode_delayed_items()
                                          __btrfs_update_delayed_inode()
                                            btrfs_lookup_inode()

                                              write_lock(&eb->lock);
                                                --> task waits for lock

    read_lock(&eb->lock);
    --> makes this task hang
        forever (and task B too
	of course)

So fix this by avoiding doing the nested read lock, which is easily
avoidable. This issue does not happen if task B calls write_lock() after
task A does the second call to read_lock(), however there does not seem
to exist anything in the documentation that mentions what is the expected
behaviour for recursive locking of rwlocks (leaving the idea that doing
so is not a good usage of rwlocks).

Also, as a side effect necessary for this fix, make sure we do not
needlessly read lock extent buffers when the input path has skip_locking
set (used when called from send).

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-05 02:26:25 +00:00
8e217858ee btrfs: fix iterator with update error in backref.c
Fix the following error:

fs/btrfs/backref.c:565:1-20: iterator with update on line 577

Fixes: a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:27:18 +01:00
a7ca42256d btrfs: use list_for_each_entry* in backref.c
Use list_for_each_entry*() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:42:46 +01:00
2d9e977610 Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:22:08 -08:00
2849a85422 btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs.  It was trying to
get the leaf out of a path after freeing the path:

	btrfs_release_path(path);
	leaf = path->nodes[0];
	item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:28 -07:00
d9ee522ba3 Btrfs: fix qgroup sanity tests
With my changes to allow us to find old roots when resolving indirect refs I
introduced a regression to the sanity tests.  Since we don't really care to go
down into the fs roots we just need to have the old behavior of returning ENOENT
for dummy roots for the sanity tests.  In the future if we want to get fancy we
can populate the test fs trees with the references as well.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:41 -07:00
acdf898de8 Btrfs: fix warning in backref walking
When we do backref walking, we search firstly in queued delayed refs
and then the on-disk backrefs, but we parse differently for shared
references, for delayed refs we also add 'ref->root' while for on-disk
backrefs we don't, this can prevent us from merging refs indexed
by the same bytenr and cause find_parent_nodes() to throw a warning at
'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
that happens.

For shared references, no matter if it's delayed or on-disk, ref->root is
not at all used, instead it's ref->parent that really matters, so this has
delayed refs handled as the same way as on-disk refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:33:50 -07:00
d6589101b6 Btrfs: teach backref walking about backrefs with underflowed offset values
When cloning/deduplicating file extents (through the clone and extent_same
ioctls) we can get data back references with offset values that are a
result of an unsigned integer arithmetic underflow, that is, values that
are much larger then they could be otherwise.

This is not a problem when decrementing or dropping the back references
(happens when we overwrite the extents or punch a hole for example, through
__btrfs_drop_extents()), since we compute the same too large offset value,
but it is a problem for the backref walking code, used by an incremental
send and the ioctls that are used by the btrfs tool "inspect-internal"
commands, as it makes it miss the corresponding file extent items because
the search key is set for an extent item that starts at an offset matching
the exceptionally large offset value of the data back reference. For an
incremental send this causes the send ioctl to fail with -EIO.

So teach the backref walking code to deal with these cases by setting the
search key's offset to 0 if the backref's offset value is larger than
LLONG_MAX (the largest possible file offset). This makes sure the backref
walking code finds the corresponding file extent items at the expense of
scanning more items and leafs in the btree.

Fixing the clone/dedup ioctls to not produce such underflowed results would
require major changes breaking backward compatibility, updating user space
tools, etc.

Simple reproducer case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K starting at file
  # offset 128K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1

  # Now clone parts of the original extent into lower offsets of the file.
  #
  # The first clone operation adds a file extent item to file offset 0
  # that points to our initial extent with a data offset of 16K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709535232, which is the result of file_offset - data_offset
  # = 0 - 16K.
  #
  # The second clone operation adds a file extent item to file offset 16K
  # that points to our initial extent with a data offset of 48K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709518848, which is the result of file_offset - data_offset
  # = 16K - 48K.
  #
  # Those large back reference offsets (result of unsigned arithmetic
  # underflow) confused the back reference walking code (used by an
  # incremental send and the multiple inspect-internal ioctls) and made it
  # miss the back references, which for the case of an incremental send it
  # made it fail with -EIO and print a message like the following to
  # dmesg:
  #
  # "BTRFS error (device sdc): did not find backref in send_root. \
  #  inode=257, offset=0, disk_byte=12845056 found extent=12845056"
  #
  $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo
  $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
      -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $send_files_dir/2.snap

  echo "File digest in the original filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  status=0
  exit

The test's expected golden output is:

  wrote 65536/65536 bytes at offset 131072
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File digest in the original filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
  File digest in the new filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo

But it failed with:

    (...)
    @@ -1,7 +1,5 @@
     QA output created by 097
     wrote 65536/65536 bytes at offset 131072
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -File digest in the original filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    -File digest in the new filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    ...

  $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
  (...)
  ERROR: send ioctl failed with -5: Input/output error

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:00 -07:00
21633fc603 btrfs: backref: Add special time_seq == (u64)-1 case for
btrfs_find_all_roots().

Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
lock to do tree search.

This is important for later qgroup implement which will call
find_all_roots() after fs trees are committed.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:45 -07:00
c6fc245499 btrfs: delayed-ref: Use list to replace the ref_root in ref_head.
This patch replace the rbtree used in ref_head to list.
This has the following advantage:
1) Easier merge logic.
With the new list implement, we only need to care merging the tail
ref_node with the new ref_node.
And this can be done quite easy at insert time, no need to do a
indicated merge at run_delayed_refs().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:03 -07:00
00db646d3f btrfs: backref: Don't merge refs which are not for same block.
Old __merge_refs() in backref.c will even merge refs whose root_id are
different, which makes qgroup gives wrong result.

Fix it by checking ref_for_same_block() before any mode specific works.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:24:59 -07:00