5319 Commits

Author SHA1 Message Date
Dave Jones
8f682f6955 btrfs: remove open-coded swap() in backref.c:__merge_refs
The kernel provides a swap() that does the same thing as this code.

Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:45:55 +01:00
Byongho Lee
ac1407ba24 btrfs: remove redundant error check
While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create().  So I
remove the first one.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:35:27 +01:00
Byongho Lee
0138b6fe8f btrfs: simplify expression in btrfs_calc_trans_metadata_size()
Simplify expression in btrfs_calc_trans_metadata_size().

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:33:17 +01:00
Josef Bacik
baee879064 Btrfs: check reserved when deciding to background flush
We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space.  We don't want to do this however when we are actually using
this space as it causes unneeded thrashing.  We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.

My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027

		No Patch	Patch
avg: 		5385		5009
median:		5500		4916

We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:29:43 +01:00
Josef Bacik
88d3a5aaf6 Btrfs: add transaction space reservation tracepoints
There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point.  With these added my tool no longer sees transaction
leaks.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:41 +01:00
Josef Bacik
dc95f7bfc5 Btrfs: fix truncate_space_check
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count.  We need a tracepoint here
so that we have the matching reserve for the release that will come later.  Also
add a comment to make clear what the intent of truncate_space_check is.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:24 +01:00
Josef Bacik
fb4b10e5d5 Btrfs: change how we update the global block rsv
I'm writing a tool to visualize the enospc system in order to help debug enospc
bugs and I found weird data and ran it down to when we update the global block
rsv.  We add all of the remaining free space to the block rsv, do a trace event,
then remove the extra and do another trace event.  This makes my visualization
look silly and is unintuitive code as well.  Fix this stuff to only add the
amount we are missing, or free the amount we are missing.  This is less clean to
read but more explicit in what it is doing, as well as only emitting events for
values that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:21:48 +01:00
Zhao Lei
7aff8cf4a6 btrfs: reada: ignore creating reada_extent for a non-existent device
For a non-existent device, old code bypasses adding it in dev's reada
queue.

And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59fd22 ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.

Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
4fe7a0e138 btrfs: reada: avoid undone reada extents in btrfs_reada_wait
Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case

And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.

Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
   So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
   It reduced no-thread window time.

But after we fixed above case, the "undone reada extents" frequently
happened.

Fix:
 Check to ensure we have at least one thread if there are undone jobs
 in btrfs_reada_wait().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
2fefd5583f btrfs: reada: limit max works count
Reada creates 2 works for each level of tree recursively.

In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.

The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
895a11b868 btrfs: reada: simplify dev->reada_in_flight processing
No need to decrease dev->reada_in_flight in __readahead_hook()'s
internal and reada_extent_put().
reada_extent_put() have no chance to decrease dev->reada_in_flight
in free operation, because reada_extent have additional refcnt when
scheduled to a dev.

We can put inc and dec operation for dev->reada_in_flight to one
place instead to make logic simple and safe, and move useless
reada_extent->scheduled_for to a bool flag instead.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
8afd6841e1 btrfs: reada: Fix a debug code typo
Remove one copy of loop to fix the typo of iterate zones.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
57f16e0826 btrfs: reada: Jump into cleanup in direct way for __readahead_hook()
Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
02873e4325 btrfs: reada: Use fs_info instead of root in __readahead_hook's argument
What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
6e39dbe8b9 btrfs: reada: Pass reada_extent into __readahead_hook directly
reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
b257cf5006 btrfs: reada: move reada_extent_put to place after __readahead_hook()
We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.

Actually it is not a problem after my patch named:
  Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.

But we still need this patch to make the code in pretty logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
1e7970c0f3 btrfs: reada: Remove level argument in severial functions
level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
3194502118 btrfs: reada: bypass adding extent when all zone failed
When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.

We should bypass this extent to avoid above case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
6a159d2ae4 btrfs: reada: add all reachable mirrors into reada device list
If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:12 +01:00
Zhao Lei
a3f7fde243 btrfs: reada: Move is_need_to_readahead contition earlier
Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:26:10 +01:00
Ingo Molnar
3a2f2ac9b9 Merge branch 'x86/urgent' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 09:28:03 +01:00
Zhao Lei
97d5f0e63d btrfs: reada: Avoid many times of empty loop
We can see following loop(10000 times) in trace_log:
 [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 ...(10000 times)

 [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

Reason:
 If more than one user trigger reada in same extent, the first task
 finished setting of reada data struct and call reada_start_machine()
 to start, and the second task only add a ref_count but have not
 add reada_extctl struct completely, the reada_extent can not finished
 all jobs, and will be selected in __reada_start_machine() for 10000
 times(total times in __reada_start_machine()).

Fix:
 For a reada_extent without job, we don't need to run it, just return
 0 to let caller break.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
8e9aa51f54 btrfs: reada: Add missed segment checking in reada_find_zone
In rechecking zone-in-tree, we still need to check zone include
our logical address.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
c37f49c7ef btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone
We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Zhao Lei
503785306d btrfs: reada: Fix in-segment calculation for reada
reada_zone->end is end pos of segment:
 end = start + cache->key.offset - 1;

So we need to use "<=" in condition to judge is a pos in the
segment.

The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-16 13:21:45 +01:00
Filipe Manana
1636d1d77e Btrfs: fix direct IO requests not reporting IO error to user space
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:

   dio_end_io(dio_bio, bio->bi_error);

It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().

This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.

Cc: stable@vger.kernel.org  # 4.3+
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-16 03:41:26 +00:00
Linus Torvalds
27c9d772e5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes from Filipe, along with a readdir fix from Dave
  that we've been testing for some time"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: properly set the termination value of ctx->pos in readdir
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: remove no longer used function extent_read_full_page_nolock()
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
2016-02-12 09:21:28 -08:00
Qu Wenruo
fed8f166eb btrfs: Introduce new mount option alias for nologreplay
Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:49 +01:00
Qu Wenruo
96da09192c btrfs: Introduce new mount option to disable tree log replay
Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:49 +01:00
Qu Wenruo
8dcddfa048 btrfs: Introduce new mount option usebackuproot to replace recovery
Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.

Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.

Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.

This provides the basis for later unified "norecovery" mount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
  docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-12 15:14:14 +01:00
David Sterba
9f07e1d76e btrfs: teach print_leaf about temporary item subtypes
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
585a3d0d23 btrfs: teach print_leaf about permanent item subtypes
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
242e2956e4 btrfs: switch dev stats item to the permanent item key
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
50c2d5abe6 btrfs: introduce key type for persistent permanent items
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.

Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension.  The
victim is the BTRFS_DEV_STATS_KEY (248).

The device stats are an example of a permanent item.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
c479cb4f14 btrfs: switch balance item to the temporary item key
No visible change.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
0bbbccb17f btrfs: introduce key type for persistent temporary items
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension.  The victim is the BTRFS_BALANCE_ITEM_KEY (248).

The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
bc4ef7592f btrfs: properly set the termination value of ctx->pos in readdir
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-02-11 07:01:59 -08:00
David Sterba
66722f7c05 btrfs: switch to kcalloc in btrfs_cmp_data_prepare
Kcalloc is functionally equivalent and does overflow checks.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
fd95ef56b1 btrfs: extent same: use GFP_KERNEL for page array allocations
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers. Here we can allocate up to 32k so less pressure to the
allocator could help.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
78f2c9e6db btrfs: device add and remove: use GFP_KERNEL
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
49e350a491 btrfs: readdir: use GFP_KERNEL
Readdir is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
32fc932e30 btrfs: fallocate: use GFP_KERNEL
Fallocate is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
74e4d82757 btrfs: let callers of btrfs_alloc_root pass gfp flags
We don't need to use GFP_NOFS in all contexts, eg. during mount or for
dummy root tree, but we might for the the log tree creation.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
58c4e17384 btrfs: scrub: use GFP_KERNEL on the submission path
Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.

Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.

Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
ed0244faf5 btrfs: reada: use GFP_KERNEL everywhere
The readahead framework is not on the critical writeback path we don't
need to use GFP_NOFS for allocations. All error paths are handled and
the readahead failures are not fatal. The actual users (scrub,
dev-replace) will trigger reads if the blocks are not found in cache.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
David Sterba
e780b0d1c1 btrfs: send: use GFP_KERNEL everywhere
The send operation is not on the critical writeback path we don't need
to use GFP_NOFS for allocations. All error paths are handled and the
whole operation is restartable.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
Filipe Manana
0c0fe3b0fa Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
While doing some tests I ran into an hang on an extent buffer's rwlock
that produced the following trace:

[39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
[39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
[39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800016] irq event stamp: 0
[39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
[39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
[39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
[39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
[39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
[39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
[39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
[39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
[39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
[39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800016] Stack:
[39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
[39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
[39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
[39389.800016] Call Trace:
[39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
[39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
[39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
[39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
[39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
[39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
[39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
[39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
[39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
[39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
[39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[39389.800012] irq event stamp: 0
[39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
[39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
[39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
[39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
[39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
[39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
[39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
[39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
[39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
[39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
[39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
[39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
[39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
[39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
[39389.800012] Stack:
[39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
[39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
[39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
[39389.800012] Call Trace:
[39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
[39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
[39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
[39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
[39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
[39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
[39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
[39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
[39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
[39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
[39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
[39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
[39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
[39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
[39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
[39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
[39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
[39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
[39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00

This happens because in the code path executed by the inode_paths ioctl we
end up nesting two calls to read lock a leaf's rwlock when after the first
call to read_lock() and before the second call to read_lock(), another
task (running the delayed items as part of a transaction commit) has
already called write_lock() against the leaf's rwlock. This situation is
illustrated by the following diagram:

         Task A                       Task B

  btrfs_ref_to_path()               btrfs_commit_transaction()
    read_lock(&eb->lock);

                                      btrfs_run_delayed_items()
                                        __btrfs_commit_inode_delayed_items()
                                          __btrfs_update_delayed_inode()
                                            btrfs_lookup_inode()

                                              write_lock(&eb->lock);
                                                --> task waits for lock

    read_lock(&eb->lock);
    --> makes this task hang
        forever (and task B too
	of course)

So fix this by avoiding doing the nested read lock, which is easily
avoidable. This issue does not happen if task B calls write_lock() after
task A does the second call to read_lock(), however there does not seem
to exist anything in the documentation that mentions what is the expected
behaviour for recursive locking of rwlocks (leaving the idea that doing
so is not a good usage of rwlocks).

Also, as a side effect necessary for this fix, make sure we do not
needlessly read lock extent buffers when the input path has skip_locking
set (used when called from send).

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-05 02:26:25 +00:00
Filipe Manana
7f042a8370 Btrfs: remove no longer used function extent_read_full_page_nolock()
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:10 +00:00
Filipe Manana
3131400230 Btrfs: fix page reading in extent_same ioctl leading to csum errors
In the extent_same ioctl, we were grabbing the pages (locked) and
attempting to read them without bothering about any concurrent IO
against them. That is, we were not checking for any ongoing ordered
extents nor waiting for them to complete, which leads to a race where
the extent_same() code gets a checksum verification error when it
reads the pages, producing a message like the following in dmesg
and making the operation fail to user space with -ENOMEM:

[18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868

Fix this by using btrfs_readpage() for reading the pages instead of
extent_read_full_page_nolock(), which waits for any concurrent ordered
extents to complete and locks the io range. Also do better error handling
and don't treat all failures as -ENOMEM, as that's clearly misleasing,
becoming identical to the checks and operation of prepare_uptodate_page().

The use of extent_read_full_page_nolock() was required before
commit f441460202cb ("btrfs: fix deadlock with extent-same and readpage"),
as we had the range locked in an inode's io tree before attempting to
read the pages.

Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage")
Cc: stable@vger.kernel.org   # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:10 +00:00
Filipe Manana
e0bd70c67b Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
In the extent_same ioctl we are getting the pages for the source and
target ranges and unlocking them immediately after, which is incorrect
because later we attempt to map them (with kmap_atomic) and access their
contents at btrfs_cmp_data(). When we do such access the pages might have
been relocated or removed from memory, which leads to an invalid memory
access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
which produces a trace like the following:

186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
unloaded: btrfs]
[186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
[186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
[186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
[186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
[186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
[186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
[186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
[186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
[186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
[186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
[186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
[186736.681319] Stack:
[186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
[186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
[186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
[186736.681319] Call Trace:
[186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
[186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0

(gdb) list *(btrfs_ioctl+0x24cb)
0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
2967                    dst_addr = kmap_atomic(dst_page);
2968
2969                    flush_dcache_page(src_page);
2970                    flush_dcache_page(dst_page);
2971
2972                    if (memcmp(addr, dst_addr, cmp_len))
2973                            ret = BTRFS_SAME_DATA_DIFFERS;
2974
2975                    kunmap_atomic(addr);
2976                    kunmap_atomic(dst_addr);

So fix this by making sure we keep the pages locked and respect the same
locking order as everywhere else: get and lock the pages first and then
lock the range in the inode's io tree (like for example at
__btrfs_buffered_write() and extent_readpages()). If an ordered extent
is found after locking the range in the io tree, unlock the range,
unlock the pages, wait for the ordered extent to complete and repeat the
entire locking process until no overlapping ordered extents are found.

Cc: stable@vger.kernel.org   # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-03 19:27:09 +00:00