35608 Commits

Author SHA1 Message Date
Justin Maggard
2c6a92b009 btrfs: wake up transaction thread upon remount
Now that we can adjust the commit interval with a remount, we need
to wake up the transaction thread or else he will continue to sleep
until the previous transaction interval has elapsed before waking
up.  So, if we go from a large commit interval to something smaller,
the transaction thread will not wake up until the large interval has
expired.  This also causes the cleaner thread to stay sleeping, since
it gets woken up by the transaction thread.

Fix it by simply waking up the transaction thread during a remount.

Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:45 -04:00
Miao Xie
50471a388c Btrfs: stop joining the log transaction if sync log fails
If the log sync fails, there is something wrong in the log tree, we
should not continue to join the log transaction and log the metadata.
What we should do is to do a full commit.

This patch fixes this problem by setting ->last_trans_log_full_commit
to the current transaction id, it will tell the tasks not to join
the log transaction, and do a full commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:44 -04:00
Miao Xie
d1433debe7 Btrfs: just wait or commit our own log sub-transaction
We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Miao Xie
8b050d350c Btrfs: fix skipped error handle when log sync failed
It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Miao Xie
bb14a59b61 Btrfs: use signed integer instead of unsigned long integer for log transid
The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:42 -04:00
Miao Xie
7483e1a446 Btrfs: remove unnecessary memory barrier in btrfs_sync_log()
Mutex unlock implies certain memory barriers to make sure all the memory
operation completes before the unlock, and the next mutex lock implies memory
barriers to make sure the all the memory happens after the lock. So it is
a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:41 -04:00
Miao Xie
e87ac13687 Btrfs: don't start the log transaction if the log tree init fails
The old code would start the log transaction even the log tree init
failed, it was unnecessary. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:40 -04:00
Miao Xie
48cab2e071 Btrfs: fix the skipped transaction commit during the file sync
We may abort the wait earlier if ->last_trans_log_full_commit was set to
the current transaction id, at this case, we need commit the current
transaction instead of the log sub-transaction. But the current code
didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:40 -04:00
Miao Xie
5c902ba622 Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit
->last_trans_log_full_commit may be changed by the other tasks without lock,
so we need prevent the compiler from the optimize access just like
	tmp = fs_info->last_trans_log_full_commit
	if (tmp == ...)
		...

	<do something>

	if (tmp == ...)
		...

In fact, we need get the new value of ->last_trans_log_full_commit during
the second access. Fix it by ACCESS_ONCE().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:39 -04:00
Liu Bo
7813b3db0a Btrfs: avoid warning bomb of btrfs_invalidate_inodes
So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so we get to detect
transaction abortion and not warn at all.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:38 -04:00
Liu Bo
2a85d9cac1 Btrfs: fix possible deadlock in btrfs_cleanup_transaction
[13654.480669] ======================================================
[13654.480905] [ INFO: possible circular locking dependency detected ]
[13654.481003] 3.12.0+ #4 Tainted: G        W  O
[13654.481060] -------------------------------------------------------
[13654.481060] btrfs-transacti/9347 is trying to acquire lock:
[13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060] but task is already holding lock:
[13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
[13654.481060] which lock already depends on the new lock.

[13654.481060] the existing dependency chain (in reverse order) is:
[13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
[13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
[13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
[13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
[13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
[13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
[13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
[13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
[13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
[13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
[13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
[13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
[13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
[13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
[13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
[13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
[13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
[13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
[13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
[13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
[13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
[13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
[13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
[13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
[13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
[13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[13654.481060] other info that might help us debug this:

[13654.481060]  Possible unsafe locking scenario:

[13654.481060]        CPU0                    CPU1
[13654.481060]        ----                    ----
[13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]
 *** DEADLOCK ***
[...]

======================================================

btrfs_destroy_all_ordered_extents()
gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
while btrfs_[add,remove]_ordered_extent()
acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.

This patch fixes the above problem.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:37 -04:00
Filipe David Borba Manana
d5f375270a Btrfs: faster/more efficient insertion of file extent items
This is an extension to my previous commit titled:

  "Btrfs: faster file extent item replace operations"
  (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9)

Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.

This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:37 -04:00
Stanislaw Gruszka
51b98effa4 btrfs: always choose work from prio_head first
In case we do not refill, we can overwrite cur pointer from prio_head
by one from not prioritized head, what looks as something that was
not intended.

This change make we always take works from prio_head first until it's
not empty.

Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:36 -04:00
Wang Shilong
dcfd5ad2fc Revert "Btrfs: remove transaction from btrfs send"
This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2.
Previously i was thinking we can use readonly root's commit root
safely while it is not true, readonly root may be cowed with the
following cases.

1.snapshot send root will cow source root.
2.balance,device operations will also cow readonly send root
to relocate.

So i have two ideas to make us safe to use commit root.

-->approach 1:
make it protected by transaction and end transaction properly and we research
next item from root node(see btrfs_search_slot_for_read()).

-->approach 2:
add another counter to local root structure to sync snapshot with send.
and add a global counter to sync send with exclusive device operations.

So with approach 2, send can use commit root safely, because we make sure
send root can not be cowed during send. Unfortunately, it make codes *ugly*
and more complex to maintain.

To make snapshot and send exclusively, device operations and send operation
exclusively with each other is a little confusing for common users.

So why not drop into previous way.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:35 -04:00
Wang Shilong
bcbba5e659 Btrfs: skip readonly root for snapshot-aware defragment
Btrfs send is assuming readonly root won't change, let's skip readonly root.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:34 -04:00
Wang Shilong
850a8cdffe Btrfs: switch to btrfs_previous_extent_item()
Since we have introduced btrfs_previous_extent_item() to search previous
extent item, just switch into it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:54 -04:00
Hidetoshi Seto
f88ba6a2a4 Btrfs: skip submitting barrier for missing device
I got an error on v3.13:
 BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)

how to reproduce:
  > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
  > wipefs -a /dev/sdf2
  > mount -o degraded /dev/sdf1 /mnt
  > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt

The reason of the error is that barrier_all_devices() failed to submit
barrier to the missing device.  However it is clear that we cannot do
anything on missing device, and also it is not necessary to care chunks
on the missing device.

This patch stops sending/waiting barrier if device is missing.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:53 -04:00
Josef Bacik
29bce2f399 Btrfs: unlock extent and pages on error in cow_file_range
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
made it so we just return an error instead of unlocking all of our various
stuff.  This is a mistake and causes us to hang when we run into this.  This
patch fixes this problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:53 -04:00
Josef Bacik
c581afc8db Btrfs: balance delayed inode updates
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files.  So balance delayed inode
updates everytime we create a new inode.  With this patch we no longer use up
all of our ram with delayed inode updates.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:52 -04:00
David Sterba
1bae30982b btrfs: add simple debugfs interface
Help during debugging to export various interesting infromation and
tunables without the need of extra mount options or ioctls.

Usage:
* declare your variable in sysfs.h, and include where you need it
* define the variable in sysfs.c and make it visible via
  debugfs_create_TYPE

Depends on CONFIG_DEBUG_FS.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:51 -04:00
David Sterba
ace0105076 btrfs: send: lower memory requirements in common case
The fs_path structure uses an inline buffer and falls back to a chain of
allocations, but vmalloc is not necessary because PATH_MAX fits into
PAGE_SIZE.

The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
usually 4k. Experimental measurements show that most paths on a single
filesystem do not exceed 200 bytes, and these get stored into the inline
buffer directly, which is now 230 bytes. Longer paths are kmalloced when
needed.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:50 -04:00
Filipe David Borba Manana
dff6d0adbe Btrfs: make some tree searches in send.c more efficient
We have this pattern where we do search for a contiguous group of
items in a tree and everytime we find an item, we process it, then
we release our path, increment the offset of the search key, do
another full tree search and repeat these steps until a tree search
can't find more items we're interested in.

Instead of doing these full tree searches after processing each item,
just process the next item/slot in our leaf and don't release the path.
Since all these trees are read only and we always use the commit root
for a search and skip node/leaf locks, we're not affecting concurrency
on the trees.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:49 -04:00
Filipe David Borba Manana
a0859c0998 Btrfs: use right extent item position in send when finding extent clones
This was a leftover from the commit:

   74dd17fbe3d65829e75d84f00a9525b2ace93998
   (Btrfs: fix btrfs send for inline items and compression)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:48 -04:00
David Sterba
57fb8910c2 btrfs: send: remove BUG_ON from name_cache_delete
If cleaning the name cache fails, we could try to proceed at the cost of
some memory leak. This is not expected to happen often.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:48 -04:00
David Sterba
4d1a63b21b btrfs: send: remove BUG from process_all_refs
There are only 2 static callers, the BUG would normally be never
reached, but let's be nice.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:47 -04:00
David Sterba
1f5a7ff999 btrfs: send: squeeze bitfilelds in fs_path
We know that buf_len is at most PATH_MAX, 4k, and can merge it with the
reversed member. This saves 3 bytes in favor of inline_buf.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:46 -04:00
David Sterba
e25a812206 btrfs: send: remove virtual_mem member from fs_path
We don't need to keep track of that, it's available via is_vmalloc_addr.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:45 -04:00
David Sterba
b23ab57d48 btrfs: send: remove prepared member from fs_path
The member is used only to return value back from
fs_path_prepare_for_add, we can do it locally and save 8 bytes for the
inline_buf path.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:44 -04:00
David Sterba
64792f2535 btrfs: send: replace check with an assert in gen_unique_name
The buffer passed to snprintf can hold the fully expanded format string,
64 = 3x largest ULL + 3x char + trailing null.  I don't think that removing the
check entirely is a good idea, hence the ASSERT.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:44 -04:00
Filipe David Borba Manana
5ed7f9ff15 Btrfs: more send support for parent/child dir relationship inversion
The commit titled "Btrfs: fix infinite path build loops in incremental send"
didn't cover a particular case where the parent-child relationship inversion
of directories doesn't imply a rename of the new parent directory. This was
due to a simple logic mistake, a logical and instead of a logical or.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb3
  $ mount /dev/sdb3 /mnt/btrfs
  $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
  $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44
  $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44
  $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3
  $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
  $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send

A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:43 -04:00
Filipe David Borba Manana
03cb4fb9d8 Btrfs: fix send dealing with file renames and directory moves
This fixes a case that the commit titled:

   Btrfs: fix infinite path build loops in incremental send

didn't cover. If the parent-child relationship between 2 directories
is inverted, both get renamed, and the former parent has a file that
got renamed too (but remains a child of that directory), the incremental
send operation would use the file's old path after sending an unlink
operation for that old path, causing receive to fail on future operations
like changing owner, permissions or utimes of the corresponding inode.

This is not a regression from the commit mentioned before, as without
that commit we would fall into the issues that commit fixed, so it's
just one case that wasn't covered before.

Simple steps to reproduce this issue are:

      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt/btrfs
      $ mkdir -p /mnt/btrfs/a/b/c/d
      $ touch /mnt/btrfs/a/b/c/d/file
      $ mkdir -p /mnt/btrfs/a/b/x
      $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
      $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2
      $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2
      $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2
      $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
      $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send

A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:42 -04:00
Wang Shilong
98cfee2143 Btrfs: only add roots if necessary in find_parent_nodes()
find_all_leafs() dosen't need add all roots actually, add roots only
if we need, this can avoid unnecessary ulist dance.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:41 -04:00
Hugo Mills
abccd00f8a btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl
The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
and 64-bit systems. This means that it is impossible to use btrfs
receive on a system with a 64-bit kernel and 32-bit userspace, because
the structure size (and hence the ioctl number) is different.

This patch adds a compatibility structure and ioctl to deal with the
above case.

Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:40 -04:00
Filipe David Borba Manana
d86477b303 Btrfs: add missing error check in incremental send
Function wait_for_parent_move() returns negative value if an error
happened, 0 if we don't need to wait for the parent's move, and
1 if the wait is needed.
Before this change an error return value was being treated like the
return value 1, which was not correct.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:40 -04:00
Miao Xie
c404e0dc2c Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
  the mapping tree. And we forgot to replace the source device in those
  mapping of the new chunks.
- We might get the mapping information which including the source device
  before the mapping information update. And then submit the bio which was
  based on that mapping information after we freed the source device.

For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.

For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.

Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:39 -04:00
Miao Xie
391cd9df81 Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace
the alloc list of the filesystem is protected by ->chunk_mutex, we need
get that mutex when we insert the new device into the list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:38 -04:00
Kusanagi Kouichi
23ad5b17dc btrfs: Return EXDEV for cross file system snapshot
EXDEV seems an appropriate error if an operation fails bacause it
crosses file system boundaries.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:37 -04:00
Miao Xie
827463c49f Btrfs: don't mix the ordered extents of all files together during logging the inodes
There was a problem in the old code:
If we failed to log the csum, we would free all the ordered extents in the log list
including those ordered extents that were logged successfully, it would make the
log committer not to wait for the completion of the ordered extents.

This patch doesn't insert the ordered extents that is about to be logged into
a global list, instead, we insert them into a local list. If we log the ordered
extents successfully, we splice them with the global list, or we will throw them
away, then do full sync. It can also reduce the lock contention and the traverse
time of list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:36 -04:00
Thomas Gleixner
f9a8a0abc3 Merge branch 'fortglx/3.15/time' of git://git.linaro.org/people/john.stultz/linux into timers/core
- support CLOCK_BOOTTIME clock in timerfd
 - Add missing header file

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-10 19:53:09 +01:00
Al Viro
bd2a31d522 get rid of fget_light()
instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:42 -04:00
Linus Torvalds
9c225f2655 vfs: atomic f_pos accesses as per POSIX
Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
Al Viro
1b56e98990 ocfs2 syncs the wrong range...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:43:32 -04:00
Linus Torvalds
fe9ea91cde NFS client bugfixes for Linux 3.14
Highlights include:
 
 - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER
 - Fix an Oopsable delegation callback race
 - Fix another bad stateid infinite loop
 - Fail the data server I/O is the stateid represents a lost lock
 - Fix an Oopsable sunrpc trace event
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTHJSVAAoJEGcL54qWCgDyVRkP/2t43gjMF6P+Yc7VUW2e5uTv
 rHhPGFLuDVs9oS3WUYegzvThZMs//ovTaYgUSDNpOYztEB6P8bDRm41q/VgUIixY
 zWFoEplDgAZZE7gP2EJuXJv3bEdhJqXuCG2KUysqMsaIGlahrlQdHmqGTz6Y931o
 WROyMWVvnL4IoEtQHVR7DwyqkvSmifPJ8MZZv3Liy82wuw1fCsh8uy8mkYYSbdvN
 OK4JmHqdJ+CbAZ0WmE4Xe3Itqy/aIMBL9Jyrq4Zl1QX0p7ez3Xpy4XwmtlZXn2KP
 bKMfK2vP9RggagIpjUL+dhCqxlsyjlF6EzTnQRe7jXqlJ/vJ9pQF8X294jwRysfp
 80jDqsTSND4JQiZuBISID23N1nL0TzrP2tWqipR9zx5JJMRVzYZWTzEq4w2uAHgg
 aW2vTdRNRLZWydlfFNQ8FiuEPIFoQaJFmOCQisec2LtfffLZZBz7JPofjNH9CgU8
 mcbPhv75m2imXDOylydiVoD4x/myCGheYw2hpqhb1ZeuQxdN9lnwa0JzjPiP1h38
 XIYwzM7TE8WayrdkMDCeIem1dz/VexknfKmXmFXlMfn3GRKxowCSrggxKG92k0eP
 L35cJj91a9AoxMz/ej0erv0iI1flLeoYP9aJzIRtZf+SB1BZkKhmWlFRQKqnlIOA
 BzjYui4mUoEQEa5Sk7Th
 =JfQx
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER
   - Fix an Oopsable delegation callback race
   - Fix another bad stateid infinite loop
   - Fail the data server I/O is the stateid represents a lost lock
   - Fix an Oopsable sunrpc trace event"

* tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Fix oops when trace sunrpc_task events in nfs client
  NFSv4: Fail the truncate() if the lock/open stateid is invalid
  NFSv4.1 Fail data server I/O if stateid represents a lost lock
  NFSv4: Fix the return value of nfs4_select_rw_stateid
  NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid
  NFS: Fix a delegation callback race
  NFSv4: Fix another nfs4_sequence corruptor
2014-03-09 19:17:39 -07:00
Tejun Heo
b7ce40cff0 kernfs: cache atomic_write_len in kernfs_open_file
While implementing atomic_write_len, 4d3773c4bb41 ("kernfs: implement
kernfs_ops->atomic_write_len") moved data copy from userland inside
kernfs_get_active() and kernfs_open_file->mutex so that
kernfs_ops->atomic_write_len can be accessed before copying buffer
from userland; unfortunately, this could lead to locking order
inversion involving mmap_sem if copy_from_user() takes a page fault.

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G        W
  -------------------------------------------------------
  trinity-c236/10658 is trying to acquire lock:
   (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120

  but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
	 [<mm/memory.c:4188>] might_fault+0x7e/0xb0
	 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190
	 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0
	 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0
	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

 -> #0 (&of->mutex#2){+.+.+.}:
	 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
	 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
	 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
	 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
	 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
	 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
	 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
	 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&mm->mmap_sem);
				 lock(&of->mutex#2);
				 lock(&mm->mmap_sem);
    lock(&of->mutex#2);

   *** DEADLOCK ***

  1 lock held by trinity-c236/10658:
   #0:  (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

  stack backtrace:
  CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G        W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
   0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
   0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
   ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
  Call Trace:
   [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f
   [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160
   [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
   [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550
   [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
   [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
   [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0
   [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
   [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
   [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
   [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0
   [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
   [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0
   [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0
   [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
   [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
   [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

Fix it by caching atomic_write_len in kernfs_open_file during open so
that it can be determined without accessing kernfs_ops in
kernfs_fop_write().  This restores the structure of kernfs_fop_write()
before 4d3773c4bb41 with updated @len determination logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
References: http://lkml.kernel.org/g/53113485.2090407@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-08 22:08:29 -08:00
Richard Cochran
88391d49ab kernfs: fix off by one error.
The hash values 0 and 1 are reserved for magic directory entries, but
the code only prevents names hashing to 0. This patch fixes the test
to also prevent hash value 1.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-08 22:08:29 -08:00
Theodore Ts'o
0bfea8118d jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
It's not needed until we start trying to modifying fields in the
journal_head which are protected by j_list_lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-09 00:56:58 -05:00
Theodore Ts'o
6e4862a5bb jbd2: minimize region locked by j_list_lock in journal_get_create_access()
It's not needed until we start trying to modifying fields in the
journal_head which are protected by j_list_lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-09 00:46:23 -05:00
Theodore Ts'o
d2eb0b9989 jbd2: check jh->b_transaction without taking j_list_lock
jh->b_transaction is adequately protected for reading by the
jbd_lock_bh_state(bh), so we don't need to take j_list_lock in
__journal_try_to_free_buffer().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-09 00:07:19 -05:00
Theodore Ts'o
d4e839d4a9 jbd2: add transaction to checkpoint list earlier
We don't otherwise need j_list_lock during the rest of commit phase
#7, so add the transaction to the checkpoint list at the very end of
commit phase #6.  This allows us to drop j_list_lock earlier, which is
a good thing since it is a super hot lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08 22:34:10 -05:00
Theodore Ts'o
42cf3452d5 jbd2: calculate statistics without holding j_state_lock and j_list_lock
The two hottest locks, and thus the biggest scalability bottlenecks,
in the jbd2 layer, are the j_list_lock and j_state_lock.  This has
inspired some people to do some truly unnatural things[1].

[1] https://www.usenix.org/system/files/conference/fast14/fast14-paper_kang.pdf

We don't need to be holding both j_state_lock and j_list_lock while
calculating the journal statistics, so move those calculations to the
very end of jbd2_journal_commit_transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08 19:51:16 -05:00