Commit Graph

46334 Commits

Author SHA1 Message Date
Eric Biggers
cc91542ac8 ext4: do not unnecessarily null-terminate encrypted symlink data
Null-terminating the fscrypt_symlink_data on read is unnecessary because
it is not string data --- it contains binary ciphertext.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:44:17 -04:00
gmail
e81d44778d ext4: release bh in make_indexed_dir
The commit 6050d47adc: "ext4: bail out from make_indexed_dir() on
first error" could end up leaking bh2 in the error path.

[ Also avoid renaming bh2 to bh, which just confuses things --tytso ]

Cc: stable@vger.kernel.org
Signed-off-by: yangsheng <yngsion@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:33:37 -04:00
Jan Kara
16c5468859 ext4: Allow parallel DIO reads
We can easily support parallel direct IO reads. We only have to make
sure we cannot expose uninitialized data by reading allocated block to
which data was not written yet, or which was already truncated. That is
easily achieved by holding inode_lock in shared mode - that excludes all
writes, truncates, hole punches. We also have to guard against page
writeback allocating blocks for delay-allocated pages - that race is
handled by the fact that we writeback all the pages in the affected
range and the lock protects us from new pages being created there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:03:17 -04:00
Martin Brandenburg
b78b11985a Merge branch 'misc' into for-next
Pull in an OrangeFS branch containing miscellaneous improvements.

- clean up debugfs globals
- remove dead code in sysfs
- reorganize duplicated sysfs attribute structs
- consolidate sysfs show and store functions
- remove duplicated sysfs_ops structures
- describe organization of sysfs
- make devreq_mutex static
- g_orangefs_stats -> orangefs_stats for consistency
- rename most remaining global variables
2016-09-28 14:50:46 -04:00
Al Viro
dbbab32574 cifs: get rid of unused arguments of CIFSSMBWrite()
they used to be used, but...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:54:53 -04:00
Andreas Gruenbacher
2211d5ba5c posix_acl: xattr representation cleanups
Remove the unnecessary typedefs and the zero-length a_entries array in
struct posix_acl_xattr_header.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:52:00 -04:00
Rasmus Villemoes
de04e76935 fs/aio.c: eliminate redundant loads in put_aio_ring_file
Using a local variable we can prevent gcc from reloading
aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent
loads.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:45:46 -04:00
Rasmus Villemoes
be218aa2e3 fs/internal.h: add const to ns_dentry_operations declaration
The actual definition in fs/nsfs.c is already const.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:45:46 -04:00
Arnd Bergmann
9dcfcda576 compat: remove compat_printk()
After 7e8e385aaf ("x86/compat: Remove sys32_vm86_warning"), this
function has become unused, so we can remove it as well.

Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-09-27 21:20:53 -04:00
Deepa Dinamani
c2050a454c fs: Replace current_fs_time() with current_time()
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:22 -04:00
Deepa Dinamani
02027d42c3 fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
CURRENT_TIME_SEC is not y2038 safe. current_time() will
be transitioned to use 64 bit time along with vfs in a
separate patch.
There is no plan to transistion CURRENT_TIME_SEC to use
y2038 safe time interfaces.

current_time() will also be extended to use superblock
range checking parameters when range checking is introduced.

This works because alloc_super() fills in the the s_time_gran
in super block to NSEC_PER_SEC.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:22 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Deepa Dinamani
2554c72edb fs: proc: Delete inode time initializations in proc_alloc_inode()
proc uses new_inode_pseudo() to allocate a new inode.
This in turn calls the proc_inode_alloc() callback.
But, at this point, inode is still not initialized
with the super_block pointer which only happens just
before alloc_inode() returns after the call to
inode_init_always().

Also, the inode times are initialized again after the
call to new_inode_pseudo() in proc_inode_alloc().
The assignemet in proc_alloc_inode() is redundant and
also doesn't work after the current_time() api is
changed to take struct inode* instead of
struct *super_block.

This bug was reported after current_time() was used to
assign times in proc_alloc_inode().

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:20 -04:00
Deepa Dinamani
3cd886666f vfs: Add current_time() api
current_fs_time() is used for inode timestamps.

Change the signature of the function to take inode pointer
instead of superblock as per Linus's suggestion.

Also, move the api under vfs as per the discussion on the
thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
suggestion on the thread, changing the function name.

current_fs_time() will be deleted after all the references
to it are replaced by current_time().

There was a bug reported by kbuild test bot with the change
as some of the calls to current_time() were made before the
super_block was initialized. Catch these accidental assignments
as timespec_trunc() does for wrong granularities. This allows
for the function to work right even in these circumstances.
But, adds a warning to make the user aware of the bug.

A coccinelle script was used to identify all the current
.alloc_inode super_block callbacks that updated inode timestamps.
proc filesystem was the only one that was modifying inode times
as part of this callback. The series includes a patch to fix that.

Note that timespec_trunc() will also be moved to fs/inode.c
in a separate patch when this will need to be revamped for
bounds checking purposes.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:20 -04:00
Eric Biggers
0026ba4008 fs/buffer.c: make __getblk_slow() static
__getblk_slow() was exported to modules in commit 3b5e6454aa
("fs/buffer.c: support buffer cache allocations with gfp modifiers").
This seems to have been a mistake, as no users were introduced nor was
the function declared in a header.  Change it back to 'static'.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
Alexey Dobriyan
771187d61b proc: unsigned file descriptors
Make struct proc_inode::fd unsigned.

This allows better code generation on x86_64 (less sign extensions).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
Alexey Dobriyan
9b80a184ea fs/file: more unsigned file descriptors
Propagate unsignedness for grand total of 149 bytes:

	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
	add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
	function                                     old     new   delta
	set_close_on_exec                             99      98      -1
	put_files_struct                             201     200      -1
	get_close_on_exec                             59      58      -1
	do_prlimit                                   498     497      -1
	do_execveat_common.isra                     1662    1661      -1
	__close_fd                                   178     173      -5
	do_dup2                                      219     204     -15
	seq_show                                     685     660     -25
	__alloc_fd                                   384     357     -27
	dup_fd                                       718     646     -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
Shawn Lin
85e7340f21 fs: compat: remove redundant check of nr_segs
nr_segs should never be less than zero as its type
is unsigned long, so let's remove this check.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
David Howells
a818101d7b cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
An NULL-pointer dereference happens in cachefiles_mark_object_inactive()
when it tries to read i_blocks so that it can tell the cachefilesd daemon
how much space it's making available.

The problem is that cachefiles_drop_object() calls
cachefiles_mark_object_inactive() after calling cachefiles_delete_object()
because the object being marked active staves off attempts to (re-)use the
file at that filename until after it has been deleted.  This means that
d_inode is NULL by the time we come to try to access it.

To fix the problem, have the caller of cachefiles_mark_object_inactive()
supply the number of blocks freed up.

Without this, the following oops may occur:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
...
CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G          I    ------------   3.10.0-470.el7.x86_64 #1
Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000
RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
RSP: 0018:ffff8800b77c3d70  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034
RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8
RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000
R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600
R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498
FS:  0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0
 ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658
 ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600
Call Trace:
 [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles]
 [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache]
 [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache]
 [<ffffffff810a605b>] process_one_work+0x17b/0x470
 [<ffffffff810a6e96>] worker_thread+0x126/0x410
 [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460
 [<ffffffff810ae64f>] kthread+0xcf/0xe0
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff81695418>] ret_from_fork+0x58/0x90
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140

The oopsing code shows:

	callq  0xffffffff810af6a0 <wake_up_bit>
	mov    0xf8(%r12),%rax
	mov    0x30(%rax),%rax
	mov    0x98(%rax),%rax   <---- oops here
	lock add %rax,0x130(%rbx)

where this is:

	d_backing_inode(object->dentry)->i_blocks

Fixes: a5b3a80b89 (CacheFiles: Provide read-and-reset release counters for cachefilesd)
Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:31:29 -04:00
Al Viro
fc56b9838a cifs: don't use memcpy() to copy struct iov_iter
it's not 70s anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:13:04 -04:00
Al Viro
4bce9f6ee8 get rid of separate multipage fault-in primitives
* the only remaining callers of "short" fault-ins are just as happy with generic
variants (both in lib/iov_iter.c); switch them to multipage variants, kill the
"short" ones
* rename the multipage variants to now available plain ones.
* get rid of compat macro defining iov_iter_fault_in_multipage_readable by
expanding it in its only user.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:12:24 -04:00
Jan Kara
225c5161b1 ext2: Unmap metadata when zeroing blocks
When zeroing blocks for DAX allocations, we also have to unmap aliases
in the block device mappings. Otherwise writeback can overwrite zeros
with stale data from block device page cache.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-27 18:16:55 +02:00
Eric Engestrom
a1a9e5d298 debugfs: propagate release() call result
The result was being ignored and 0 was always returned.
Return the actual result instead.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-27 12:45:57 +02:00
Johannes Thumshirn
78618d395b sysfs print name of undiscoverable attribute group
Print the name of an undiscoverable attribute group and not the
pointer's address.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-27 12:24:29 +02:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Miklos Szeredi
18fc84dafa vfs: remove unused i_op->rename
No in-tree uses remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Miklos Szeredi
1cd66c93ba fs: make remaining filesystems use .rename2
This is trivial to do:

 - add flags argument to foo_rename()
 - check if flags is zero
 - assign foo_rename() to .rename2 instead of .rename

This doesn't mean it's impossible to support RENAME_NOREPLACE for these
filesystems, but it is not trivial, like for local filesystems.
RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
for a file to be created on one host while it is overwritten by rename on
another host).

Filesystems converted:

9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

After this, we can get rid of the duplicate interfaces for rename.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Howells <dhowells@redhat.com> [AFS]
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Mark Fasheh <mfasheh@suse.com>
2016-09-27 11:03:58 +02:00
Miklos Szeredi
e0e0be8a83 libfs: support RENAME_NOREPLACE in simple_rename()
This is trivial to do:

 - add flags argument to simple_rename()
 - check if flags doesn't have any other than RENAME_NOREPLACE
 - assign simple_rename() to .rename2 instead of .rename

Filesystems converted:

hugetlbfs, ramfs, bpf.

Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access.  For this case pass zero flags to simple_rename().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
2016-09-27 11:03:57 +02:00
Miklos Szeredi
f03b8ad8d3 fs: support RENAME_NOREPLACE for local filesystems
This is trivial to do:

 - add flags argument to foo_rename()
 - check if flags doesn't have any other than RENAME_NOREPLACE
 - assign foo_rename() to .rename2 instead of .rename

Filesystems converted:

affs, bfs, exofs, ext2, hfs, hfsplus, jffs2, jfs, logfs, minix, msdos,
nilfs2, omfs, reiserfs, sysvfs, ubifs, udf, ufs, vfat.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Acked-by: Richard Weinberger <richard@nod.at>
Acked-by: Bob Copeland <me@bobcopeland.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
2016-09-27 11:03:57 +02:00
Miklos Szeredi
9a232de499 ncpfs: fix unused variable warning
Without CONFIG_NCPFS_NLS the following warning is seen:

fs/ncpfs/dir.c: In function 'ncp_hash_dentry':
fs/ncpfs/dir.c:136:23: warning: unused variable 'sb' [-Wunused-variable]
   struct super_block *sb = dentry->d_sb;

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:57 +02:00
Andreas Gruenbacher
332f51d7db gfs2: Initialize atime of I_NEW inodes
Fix for commit 719ee344: initialize atime of I_NEW inodes to 0 so that
the timestamps read from disk will always be more recent than the
initial timestamp, and the atime in the I_NEW inode will be set correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-09-26 13:24:34 -05:00
Andreas Gruenbacher
d7c436cd60 gfs2: Update file times after grabbing glock
In gfs2_page_mkwrite, grab the inode glock in EX mode before calling
file_update_time: grabbing the lock may result in a call to
gfs2_dinode_in, which will reset the file times to their on-disk state.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-09-26 13:20:19 -05:00
Liu Bo
196e02490c Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
When we're not able to get enough space through splitting leaf,
we'd create a new sibling leaf instead, and it's possible that we return
 a zero-nritem sibling leaf and mark it dirty before it's in a consistent
state.  With CONFIG_BTRFS_FS_CHECK_INTEGRITY=y, the integrity check of
check_leaf will report panic due to this zero-nritem non-root leaf.

This removes the unnecessary btrfs_mark_buffer_dirty.

Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:50:44 +02:00
Josef Bacik
4867268c57 Btrfs: don't BUG() during drop snapshot
Really there's lots of things that can go wrong here, kill all the
BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO
instead.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ switched to btrfs_err, errors go to common label ]
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Arnd Bergmann
2fd57fcb16 btrfs: fix btrfs_no_printk stub helper
The addition of btrfs_no_printk() caused a build failure when
CONFIG_PRINTK is disabled:

fs/btrfs/send.c: In function 'send_rename':
fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration]

This moves the helper outside of that #ifdef so it is always
defined, and changes the existing #ifdef to refer to that
helper as well for consistency.

Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Liu Bo
851cd173f0 Btrfs: memset to avoid stale content in btree leaf
This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".

This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
0f5053eb90 btrfs: parent_start initialization cleanup
Code cleanup. parent_start is initialized multiple times when it is
not necessary to do so.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
6cea66e544 btrfs: Remove already completed TODO comment
Fixes: 7cf5b97650 ("btrfs: qgroup: Cleanup old inaccurate facilities")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
dd12d5b804 btrfs: Do not reassign count in btrfs_run_delayed_refs
Code cleanup. count is already (unsgined long)-1. That is the reason
run_all was set. Do not reassign it (unsigned long)-1.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Anand Jain
0ccd05285e btrfs: fix a possible umount deadlock
btrfs_show_devname() is using the device_list_mutex, sometimes
a call to blkdev_put() leads vfs calling into this func. So
call blkdev_put() outside of device_list_mutex, as of now.

[  983.284212] ======================================================
[  983.290401] [ INFO: possible circular locking dependency detected ]
[  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
[  983.302081] -------------------------------------------------------
[  983.308357] umount/21720 is trying to acquire lock:
[  983.313243]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.321264]
[  983.321264] but task is already holding lock:
[  983.327101]  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
[  983.337839]
[  983.337839] which lock already depends on the new lock.
[  983.337839]
[  983.346024]
[  983.346024] the existing dependency chain (in reverse order) is:
[  983.353512]
-> #4 (&fs_devs->device_list_mutex){+.+...}:
[  983.359096]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.365143]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.371521]        [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs]
[  983.378710]        [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150
[  983.384593]        [<ffffffff9126ffc7>] m_show+0x17/0x20
[  983.389957]        [<ffffffff91276405>] seq_read+0x2b5/0x3b0
[  983.395669]        [<ffffffff9124c808>] __vfs_read+0x28/0x100
[  983.401464]        [<ffffffff9124eb3b>] vfs_read+0xab/0x150
[  983.407080]        [<ffffffff9124ec32>] SyS_read+0x52/0xb0
[  983.412609]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.419617]
-> #3 (namespace_sem){++++++}:
[  983.424024]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.430074]        [<ffffffff918239e9>] down_write+0x49/0x80
[  983.435785]        [<ffffffff91272457>] lock_mount+0x67/0x1c0
[  983.441582]        [<ffffffff91272ab2>] do_add_mount+0x32/0xf0
[  983.447458]        [<ffffffff9127363a>] finish_automount+0x5a/0xc0
[  983.453682]        [<ffffffff91259513>] follow_managed+0x1b3/0x2a0
[  983.459912]        [<ffffffff9125b750>] lookup_fast+0x300/0x350
[  983.465875]        [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0
[  983.471846]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
[  983.477731]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
[  983.483702]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
[  983.489240]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.496254]
-> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}:
[  983.501798]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.507855]        [<ffffffff918239e9>] down_write+0x49/0x80
[  983.513558]        [<ffffffff91366237>] start_creating+0x87/0x100
[  983.519703]        [<ffffffff91366647>] debugfs_create_dir+0x17/0x100
[  983.526195]        [<ffffffff911df153>] bdi_register+0x93/0x210
[  983.532165]        [<ffffffff911df313>] bdi_register_owner+0x43/0x70
[  983.538570]        [<ffffffff914080fb>] device_add_disk+0x1fb/0x450
[  983.544888]        [<ffffffff91580226>] loop_add+0x1e6/0x290
[  983.550596]        [<ffffffff91fec358>] loop_init+0x10b/0x14f
[  983.556394]        [<ffffffff91002207>] do_one_initcall+0xa7/0x180
[  983.562618]        [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266
[  983.569370]        [<ffffffff918174be>] kernel_init+0xe/0x100
[  983.575166]        [<ffffffff9182620f>] ret_from_fork+0x1f/0x40
[  983.581131]
-> #1 (loop_index_mutex){+.+.+.}:
[  983.585801]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.591858]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.598256]        [<ffffffff9157ed3f>] lo_open+0x1f/0x60
[  983.603704]        [<ffffffff9128eec3>] __blkdev_get+0x123/0x400
[  983.609757]        [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350
[  983.615639]        [<ffffffff9128f554>] blkdev_open+0x64/0x80
[  983.621428]        [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0
[  983.627651]        [<ffffffff9124c029>] vfs_open+0x69/0x80
[  983.633181]        [<ffffffff9125db74>] path_openat+0x834/0xaa0
[  983.639152]        [<ffffffff9125ef75>] do_filp_open+0x85/0xe0
[  983.645035]        [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0
[  983.650999]        [<ffffffff9124c4de>] SyS_open+0x1e/0x20
[  983.656535]        [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.663541]
-> #0 (&bdev->bd_mutex){+.+.+.}:
[  983.668107]        [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
[  983.674510]        [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.680561]        [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.686967]        [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.692761]        [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.699699]        [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
[  983.707178]        [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  983.714380]        [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
[  983.721061]        [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
[  983.727908]        [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
[  983.734744]        [<ffffffff91250f56>] kill_anon_super+0x16/0x30
[  983.740888]        [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
[  983.747909]        [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
[  983.754745]        [<ffffffff912515fd>] deactivate_super+0x5d/0x70
[  983.760977]        [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
[  983.766773]        [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
[  983.772738]        [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
[  983.778708]        [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
[  983.785373]        [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
[  983.792212]        [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1
[  983.799225]
[  983.799225] other info that might help us debug this:
[  983.799225]
[  983.807291] Chain exists of:
  &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex

[  983.816521]  Possible unsafe locking scenario:
[  983.816521]
[  983.822489]        CPU0                    CPU1
[  983.827043]        ----                    ----
[  983.831599]   lock(&fs_devs->device_list_mutex);
[  983.836289]                                lock(namespace_sem);
[  983.842268]                                lock(&fs_devs->device_list_mutex);
[  983.849478]   lock(&bdev->bd_mutex);
[  983.853127]
[  983.853127]  *** DEADLOCK ***
[  983.853127]
[  983.859113] 3 locks held by umount/21720:
[  983.863145]  #0:  (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70
[  983.872713]  #1:  (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs]
[  983.882206]  #2:  (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs]
[  983.893422]
[  983.893422] stack backtrace:
[  983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1
[  983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
[  983.913492]  0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0
[  983.921018]  ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050
[  983.928542]  ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0
[  983.936072] Call Trace:
[  983.938545]  [<ffffffff91429521>] dump_stack+0x85/0xc4
[  983.943715]  [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c
[  983.949748]  [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0
[  983.955613]  [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0
[  983.961123]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
[  983.966550]  [<ffffffff91823125>] mutex_lock_nested+0x65/0x350
[  983.972407]  [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150
[  983.977832]  [<ffffffff9128ec51>] blkdev_put+0x31/0x150
[  983.983101]  [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.989500]  [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs]
[  983.996415]  [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  984.003068]  [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs]
[  984.009189]  [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170
[  984.014881]  [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs]
[  984.021176]  [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100
[  984.027476]  [<ffffffff91250f56>] kill_anon_super+0x16/0x30
[  984.033082]  [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs]
[  984.039548]  [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80
[  984.045839]  [<ffffffff912515fd>] deactivate_super+0x5d/0x70
[  984.051525]  [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80
[  984.056774]  [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20
[  984.062201]  [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0
[  984.067625]  [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4
[  984.073747]  [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0
[  984.080038]  [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Liu Bo
a958eab0ed Btrfs: fix memory leak in do_walk_down
The extent buffer 'next' needs to be free'd conditionally.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
c01f5f96f5 btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
We can hit unused variable warnings when btrfs_debug and friends are
just aliases for no_printk.  This is due to the fs_info not getting
consumed by the function call, which can happen if convenenience
variables are used.  This patch adds a new btrfs_no_printk static inline
that consumes the convenience variable and does nothing else.  It
silences the unused variable warning and has no impact on the generated
code:

$ size fs/btrfs/extent_io.o*
   text	   data	    bss	    dec	    hex	filename
  44072	    152	     32	  44256	   ace0	fs/btrfs/extent_io.o.btrfs_no_printk
  44072	    152	     32	  44256	   ace0	fs/btrfs/extent_io.o.no_printk

Fixes: 27a0dd61a5 (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
04ab956ee6 btrfs: convert send's verbose_printk to btrfs_debug
This was basically an open-coded, less flexible dynamic printk.  We can
just use btrfs_debug instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
ab8d0fc48d btrfs: convert pr_* to btrfs_* where possible
For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:04 +02:00
Jeff Mahoney
62e855771d btrfs: convert printk(KERN_* to use pr_* calls
This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
cea67ab92d btrfs: clean the old superblocks before freeing the device
btrfs_rm_device frees the block device but then re-opens it using
the saved device name.  A race exists between the close and the
re-open that allows the block size to be changed.  The result
is getting stuck forever in the reclaim loop in __getblk_slow.

This patch moves the superblock cleanup before closing the block
device, which is also consistent with other callers.  We also don't
need a private copy of dev_name as the whole routine operates under
the uuid_mutex.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Liu Bo
02794222c4 Btrfs: kill BUG_ON in run_delayed_tree_ref
In a corrupted btrfs image, we can come across this BUG_ON and
get an unreponsive system, but if we return errors instead,
its caller can handle everything gracefully by aborting the current
transaction.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Josef Bacik
6bdf131fac Btrfs: don't leak reloc root nodes on error
We don't track the reloc roots in any sort of normal way, so the only way the
root/commit_root nodes get free'd is if the relocation finishes successfully and
the reloc root is deleted.  Fix this by free'ing them in free_reloc_roots.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Masahiro Yamada
e2c8990734 btrfs: squash lines for simple wrapper functions
Remove unneeded variables and assignments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:38 +02:00
Liu Bo
6b722c1747 Btrfs: improve check_node to avoid reading corrupted nodes
We need to check items in a node to make sure that we're reading
a valid one, otherwise we could get various crashes while processing
delayed_refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:05:28 +02:00
Liu Bo
a42cbec9c6 Btrfs: add error handling for extent buffer in print tree
Somehow we missed btrfs_print_tree when last time we
updated error handling for read_extent_block().

This keeps us from getting a NULL pointer panic when
btrfs_print_tree's read_extent_block() fails.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:04:01 +02:00
Liu Bo
a43f7f8206 Btrfs: remove BUG_ON in start_transaction
Since we could get errors from the concurrent aborted transaction,
the check of this BUG_ON in start_transaction is not true any more.

Say, while flushing free space cache inode's dirty pages,
btrfs_finish_ordered_io
 -> btrfs_join_transaction_nolock
      (the transaction has been aborted.)
      -> BUG_ON(type == TRANS_JOIN_NOLOCK);

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:04:01 +02:00
Liu Bo
3eb548ee3a Btrfs: memset to avoid stale content in btree node block
During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.

So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node.  One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.

This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:03:47 +02:00
Liu Bo
3561b9db70 Btrfs: return gracefully from balance if fs tree is corrupted
When relocating tree blocks, we firstly get block information from
back references in the extent tree, we then search fs tree to try to
find all parents of a block.

However, if fs tree is corrupted, eg. if there're some missing
items, we could come across these WARN_ONs and BUG_ONs.

This makes us print some error messages and return gracefully
from balance.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
9c8e63db1d Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written
No reason to bug on in here, fs corruption could easily cause these things to
happen.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
8436ea91a1 Btrfs: kill the start argument to read_extent_buffer_pages
Nobody uses this, it makes no sense to do partial reads of extent buffers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Josef Bacik
afcdd129e0 Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Qu Wenruo
ba8b04c1d4 btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Jeff Mahoney
897a41b116 btrfs: add dynamic debug support
We can re-use the dynamic debugging descriptor to make use of the dynamic
debugging mechanism but still use our own printk interface.

Defining the DEBUG macro works as it did before.  When it's defined,
all of the messages default to print.  We can also enable all debug
messages at boot or module-load time using the 'dyndbg' and
'btrfs.dyndbg' options.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Luis Henriques
2309e79650 btrfs: Fix warning "variable ‘gen’ set but not used"
Variable 'gen' in reada_for_search() is not used since commit 58dc4ce432
("btrfs: remove unused parameter from readahead_tree_block").  This patch
simply removes this variable.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Luis Henriques
1f079fa2f8 btrfs: Fix warning "variable ‘blocksize’ set but not used"
Variable 'blocksize' in reada_walk_down() is not used since commit
d3e46fea1b ("btrfs: sink blocksize parameter to readahead_tree_block").
This patch simply removes this variable.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Naohiro Aota
5d8eb6fe51 btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs
Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But
the work can be done by btrfs_delete_unused_bgs() (and it's better since it
trim the BG). Let's dedupe the code.

While btrfs_delete_unused_bgs() is already hitting the relocated BG, it
skip the BG since the BG has "ro" flag set (to keep balancing BG intact).
On the other hand, btrfs cannot drop "ro" flag here to prevent additional
writes. So this patch make use of "removed" flag.
btrfs_delete_unused_bgs() now detect the flag to distinguish whether a
read-only BG is relocating or not.

Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
49303381f1 Btrfs: bail out if block group has different mixed flag
Currently we allow inconsistence about mixed flag
 (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA).

We'd get ENOSPC if block group has mixed flag and btrfs doesn't.
If that happens, we have one space_info with mixed flag and another
space_info only with BTRFS_BLOCK_GROUP_METADATA, and
global_block_rsv.space_info points to the latter one, but all bytes
from block_group contributes to the mixed space_info, thus all the
allocation will fail with ENOSPC.

This adds a check for the above case.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ updated message ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
2571e73967 Btrfs: fix memory leak in reading btree blocks
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
   completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
   and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
   pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
   3) doesn't count them for eb->io_pages, but they are later
   cleared by 2) so we has to readpage on the page, we get
   the wrong eb->io_pages which results in a memory leak of
   this block.

This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.

   t1(readahead)                              t2(readahead endio)                                       t3(the following read)
read_extent_buffer_pages                    end_bio_extent_readpage
  for pg in eb:                                for page 0,1,2 in eb:
      if pg is uptodate:                           btree_readpage_end_io_hook(pg)
          num_reads++                              if uptodate:
  eb->io_pages = num_reads                             SetPageUptodate(pg)              _______________
  for pg in eb:                                for page 3 in eb:                                     read_extent_buffer_pages
       if pg is NOT uptodate:                      btree_readpage_end_io_hook(pg)                       for pg in eb:
           __extent_read_full_page(pg)                 sanity check reports something wrong                 if pg is uptodate:
                                                       clear_extent_buffer_uptodate(eb)                         num_reads++
                                                           for pg in eb:                                eb->io_pages = num_reads
                                                               ClearPageUptodate(page)  _______________
                                                                                                        for pg in eb:
                                                                                                            if pg is NOT uptodate:
                                                                                                                __extent_read_full_page(pg)

So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
e46a28ca3d Btrfs: remove BUG() in raid56
This BUG() has been triggered by a fuzz testing image, which contains
an invalid chunk type, ie. a single stripe chunk has the raid6 type.

Btrfs can handle this gracefully by returning -EIO, so besides using
btrfs_warn to give us more debugging information rather than a single
BUG(), we can return error properly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Lu Fengqi
afce772e87 btrfs: fix check_shared for fiemap ioctl
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
David Sterba
b0de6c4c81 btrfs: create example debugfs file only in debugging build
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Eric Sandeen
07f6a48043 btrfs: fix perms on demonstration debugfs interface
btrfs provides a helpful demonstration of how to export
a global variable via debugfs; however, it is unique among
other debugfs files in that it is world-writable, which causes
some concern to people who are not familiar with its purpose.

Fix it so that it is only user-writable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
c79a175175 Btrfs: fix memory leak of block group cache
While processing delayed refs, we may update block group's statistics
and attach it to cur_trans->dirty_bgs, and later writing dirty block
groups will process the list, which happens during
btrfs_commit_transaction().

For whatever reason, the transaction is aborted and dirty_bgs
is not processed in cleanup_transaction(), we end up with memory leak
of these dirty block group cache.

Since btrfs_start_dirty_block_groups() doesn't make it go to the commit
critical section, this also adds the cleanup work inside it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Brian Foster
5cd9cee98b xfs: log recovery tracepoints to track current lsn and buffer submission
Log recovery has particular rules around buffer submission along with
tricky corner cases where independent transactions can share an LSN. As
such, it can be difficult to follow when/why buffers are submitted
during recovery.

Add a couple tracepoints to post the current LSN of a record when a new
record is being processed and when a buffer is being skipped due to LSN
ordering. Also, update the recover item class to include the LSN of the
current transaction for the item being processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:34:52 +10:00
Brian Foster
60a4a22251 xfs: update metadata LSN in buffers during log recovery
Log recovery is currently broken for v5 superblocks in that it never
updates the metadata LSN of buffers written out during recovery. The
metadata LSN is recorded in various bits of metadata to provide recovery
ordering criteria that prevents transient corruption states reported by
buffer write verifiers. Without such ordering logic, buffer updates can
be replayed out of order and lead to false positive transient corruption
states. This is generally not a corruption vector on its own, but
corruption detection shuts down the filesystem and ultimately prevents a
mount if it occurs during log recovery. This requires an xfs_repair run
that clears the log and potentially loses filesystem updates.

This problem is avoided in most cases as metadata writes during normal
filesystem operation update the metadata LSN appropriately. The problem
with log recovery not updating metadata LSNs manifests if the system
happens to crash shortly after log recovery itself. In this scenario, it
is possible for log recovery to complete all metadata I/O such that the
filesystem is consistent. If a crash occurs after that point but before
the log tail is pushed forward by subsequent operations, however, the
next mount performs the same log recovery over again. If a buffer is
updated multiple times in the dirty range of the log, an earlier update
in the log might not be valid based on the current state of the
associated buffer after all of the updates in the log had been replayed
(before the previous crash). If a verifier happens to detect such a
problem, the filesystem claims corruption and immediately shuts down.

This commonly manifests in practice as directory block verifier failures
such as the following, likely due to directory verifiers being
particularly detailed in their checks as compared to most others:

  ...
  Mounting V5 Filesystem
  XFS (dm-0): Starting recovery (logdev: internal)
  XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
    file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
  ...

Update log recovery to update the metadata LSN of recovered buffers.
Since metadata LSNs are already updated by write verifer functions via
attached log items, attach a dummy log item to the buffer during
validation and explicitly set the LSN of the current transaction. This
ensures that the metadata LSN of a buffer is updated based on whether
the recovery I/O actually completes, and if so, that subsequent recovery
attempts identify that the buffer is already up to date with respect to
the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:34:27 +10:00
Brian Foster
040c52c0aa xfs: don't warn on buffers not being recovered due to LSN
The log recovery buffer validation function is invoked in cases where a
buffer update may be skipped due to LSN ordering. If the validation
function happens to come across directory conversion situations (e.g., a
dir3 block to data conversion), it may warn about seeing a buffer log
format of one type and a buffer with a magic number of another.

This warning is not valid as the buffer update is ultimately skipped.
This is indicated by a current_lsn of NULLCOMMITLSN provided by the
caller. As such, update xlog_recover_validate_buf_type() to only warn in
such cases when a buffer update is expected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:32:50 +10:00
Brian Foster
22db9af248 xfs: pass current lsn to log recovery buffer validation
The current LSN must be available to the buffer validation function to
provide the ability to update the metadata LSN of the buffer. Pass the
current_lsn value down to xlog_recover_validate_buf_type() in
preparation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:32:07 +10:00
Brian Foster
12818d24db xfs: rework log recovery to submit buffers on LSN boundaries
The fix to log recovery to update the metadata LSN in recovered buffers
introduces the requirement that a buffer is submitted only once per
current LSN. Log recovery currently submits buffers on transaction
boundaries. This is not sufficient as the abstraction between log
records and transactions allows for various scenarios where multiple
transactions can share the same current LSN. If independent transactions
share an LSN and both modify the same buffer, log recovery can
incorrectly skip updates and leave the filesystem in an inconsisent
state.

In preparation for proper metadata LSN updates during log recovery,
update log recovery to submit buffers for write on LSN change boundaries
rather than transaction boundaries. Explicitly track the current LSN in
a new struct xlog field to handle the various corner cases of when the
current LSN may or may not change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:22:16 +10:00
Dave Chinner
ddeb14f4fb xfs: quiesce the filesystem after recovery on readonly mount
Recently we've had a number of reports where log recovery on a v5
filesystem has reported corruptions that looked to be caused by
recovery being re-run over the top of an already-recovered
metadata. This has uncovered a bug in recovery (fixed elsewhere)
but the vector that caused this was largely unknown.

A kdump test started tripping over this problem - the system
would be crashed, the kdump kernel and environment would boot and
dump the kernel core image, and then the system would reboot. After
reboot, the root filesystem was triggering log recovery and
corruptions were being detected. The metadumps indicated the above
log recovery issue.

What is happening is that the kdump kernel and environment is
mounting the root device read-only to find the binaries needed to do
it's work. The result of this is that it is running log recovery.
However, because there were unlinked files and EFIs to be processed
by recovery, the completion of phase 1 of log recovery could not
mark the log clean. And because it's a read-only mount, the unmount
process does not write records to the log to mark it clean, either.
Hence on the next mount of the filesystem, log recovery was run
again across all the metadata that had already been recovered and
this is what triggered corruption warnings.

To avoid this problem, we need to ensure that a read-only mount
always updates the log when it completes the second phase of
recovery. We already handle this sort of issue with rw->ro remount
transitions, so the solution is as simple as quiescing the
filesystem at the appropriate time during the mount process. This
results in the log being marked clean so the mount behaviour
recorded in the logs on repeated RO mounts will change (i.e. log
recovery will no longer be run on every mount until a RW mount is
done). This is a user visible change in behaviour, but it is
harmless.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:21:44 +10:00
Dave Chinner
292378edcb xfs: remote attribute blocks aren't really userdata
When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...

Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:21:28 +10:00
Linus Torvalds
b22734a550 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Josef fixed a problem when quotas are enabled with his latest ENOSPC
  rework, and Jeff added more checks into the subvol ioctls to avoid
  tripping up lookup_one_len"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: ensure that file descriptor used with subvol ioctls is a dir
  Btrfs: handle quota reserve failure properly
2016-09-23 13:39:37 -07:00
Linus Torvalds
e47f2e50ea One more trivial fix for the binary attribute code from Phil Turnbull.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX5KV7AAoJEA+eU2VSBFGD6hEQAINlrv/sIX2mQcxaETodsvPq
 kKt6ESgogl0ZTq3lpNhaOwhiozrvgCPJibQZarq4Qr2q2Sz+AkQzYSLCcVO+CmJB
 94w4jy2m+M+diEFKpjexJpD+LfEoJPjhfrjs9wI6CKUL2F0FS+LUUOU44gCzSKdh
 wupkVgPvC3csUZG/9QwTRxZH9Zh/DpsN2JC7MkM3YSc5ELw+YaFWWiEMNjyNMll2
 ex2l2+fhfbdHW8WGl5rCjaCfjagi1h2VMtOkbwr4LWX89IMVgAdKbtkquAcme41t
 o6oHAqN+8EZwxaWdKTR247u5dg5p7W2MeOQyJmlFzUa52fv8APrKONlUfmco/aYC
 fBvt4s0Hsg/i57dpl+ZdFIfEXzpDgQZpWCEoUvGzfNayghUBk7vF+CcTl+lzcnqA
 qEiKu9NLMpVmMb1XWCAJzWDTVhY/JJrfx/ndsHiyWlXuiI+yDvQvIIN3fVbkzzHR
 4Q52n8zVa2MaVcACb5vf0OKVaETNsemD3oMN5irGcA/RMylxnO7iKghemDYDXMfZ
 Cnm5pyIm6ZF2a9UapetKEfQawdo7UkS1wXkKMPwLhB6aoK4gbk5pxK0oUxmiQyyp
 T5o9nZ3Vmj4XoZwaaq2mlIOlj/USSIa8DChXMb43NH8agiMwFzIm8nbAHhr9TEtd
 JpaLYUe+BvqcZvTwBRxS
 =+uba
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfs

Pull configfs fix from Christoph Hellwig:
 "One more trivial fix for the binary attribute code from Phil Turnbull"

* tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfs:
  configfs: Return -EFBIG from configfs_write_bin_file.
2016-09-23 09:45:15 -07:00
David S. Miller
d6989d4bbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-23 06:46:57 -04:00
Eric W. Biederman
e98d413703 devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
In 99.99% of the cases only root in a user namespace can mount /dev/pts
and in those cases the owner of /dev/pts/ptmx will remain root.root

In the oddball case where someone else has CAP_SYS_ADMIN this code
modifies the /dev/pts mount code to use current_fsuid and current_fsgid
as the values to use when creating the /dev/ptmx inode.  As is done
when any other file is created.

This is a code simplification, and it allows running without a root
user entirely.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
6bd1d8758d devpts: Remove sync_filesystems
devpts does not and never will have anything to sync
so don't bother calling sync_filesystems on remount.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
40b320e1c7 devpts: Make devpts_kill_sb safe if fsi is NULL
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
c1b241f0c1 devpts: Simplify devpts_mount by using mount_nodev
Now that all of the work of setting up a superblock has been moved to
devpts_fill_super simplify devpts_mount by calling mount_nodev instead
of rolling mount_nodev by hand.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
180d904442 devpts: Move the creation of /dev/pts/ptmx into fill_super
The code makes more sense here and things are just clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
dee87d4736 devpts: Move parse_mount_options into fill_super
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23 11:31:31 +02:00
Eric W. Biederman
213b067ce3 nsfs: Simplify __ns_get_path
Move mntget from the very beginning of __ns_get_path to
the success path of __ns_get_path, and remove the mntget
calls.

This removes the possibility that there will be a mntget/mntput
pair of __ns_get_path has to retry, and generally simplifies the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 20:06:20 -05:00
Eric W. Biederman
7872559664 Merge branch 'nsfs-ioctls' into HEAD
From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  > The fewer special cases the easier the code is to get
  > correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22 20:00:36 -05:00
Andrey Vagin
a7306ed8d9 nsfs: add ioctl to get a parent namespace
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.

In a future we will use this interface to dump and restore nested
namespaces.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:41 -05:00
Andrey Vagin
6786741dbf nsfs: add ioctl to get an owning user namespace for ns file descriptor
Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Understending namespaces relationships allows to answer the question:
what capability does process X have to perform operations on a resource
governed by namespace Y?

After a long discussion, Eric W. Biederman proposed to use ioctl-s for
this purpose.

The NS_GET_USERNS ioctl returns a file descriptor to an owning user
namespace.
It returns EPERM if a target namespace is outside of a current user
namespace.

v2: rename parent to relative

v3: Add a missing mntput when returning -EAGAIN --EWB

Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lkml.org/lkml/2016/7/6/158
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:40 -05:00
Andrey Vagin
bcac25a58b kernel: add a helper to get an owning user namespace for a namespace
Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:39 -05:00
Yunlei He
5d4c0af41f f2fs: preallocate blocks for encrypted file
This patch allow preallocates data blocks for buffered aio writes
in encrypted file.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix to avoid BUG_ON]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22 11:43:08 -07:00
Chao Yu
5bc994a043 f2fs: show dirty inode number
This patch enables showing dirty inode number in procfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22 11:43:07 -07:00
Chao Yu
8b038c70df f2fs: support IO error injection
This patch adds to support IO error injection for testing IO error
tolerance of f2fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22 11:43:06 -07:00
Chao Yu
866969668a f2fs: fix to return error number of read_all_xattrs correctly
We treat all error in read_all_xattrs as a no memory error, which covers
the real reason of failure in it. Fix it by return correct errno in order
to reflect the real cause.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22 11:43:05 -07:00
Chao Yu
ebfa732217 f2fs: make f2fs_filetype_table static
There is no more user of f2fs_filetype_table outside of dir.c, make it
static.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22 11:43:04 -07:00
Eric W. Biederman
93f0a88bd4 devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
In 99.99% of the cases only root in a user namespace can mount /dev/pts
and in those cases the owner of /dev/pts/ptmx will remain root.root

In the oddball case where someone else has CAP_SYS_ADMIN this code
modifies the /dev/pts mount code to use current_fsuid and current_fsgid
as the values to use when creating the /dev/ptmx inode.  As is done
when any other file is created.

This is a code simplification, and it allows running without a root
user entirely.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:32:26 -05:00
Eric W. Biederman
985e5d856c devpts: Remove sync_filesystems
devpts does not and never will have anything to sync
so don't bother calling sync_filesystems on remount.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:32:20 -05:00
Eric W. Biederman
0d126a7ff7 devpts: Make devpts_kill_sb safe if fsi is NULL
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:32:16 -05:00
Eric W. Biederman
ec0a9ba6f2 devpts: Simplify devpts_mount by using mount_nodev
Now that all of the work of setting up a superblock has been moved to
devpts_fill_super simplify devpts_mount by calling mount_nodev instead
of rolling mount_nodev by hand.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:32:12 -05:00
Eric W. Biederman
7dd17f7134 devpts: Move the creation of /dev/pts/ptmx into fill_super
The code makes more sense here and things are just clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:32:08 -05:00
Eric W. Biederman
208904793a devpts: Move parse_mount_options into fill_super
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:31:58 -05:00
Eric W. Biederman
df75e7748b userns: When the per user per user namespace limit is reached return ENOSPC
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong.  I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC.  It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:25:56 -05:00
Ross Zwisler
cca32b7eeb ext4: allow DAX writeback for hole punch
Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-22 11:49:38 -04:00
Jan Kara
e03a9976af jbd2: fix lockdep annotation in add_transaction_credits()
Thomas has reported a lockdep splat hitting in
add_transaction_credits(). The problem is that that function calls
jbd2_might_wait_for_commit() while holding j_state_lock which is wrong
(we do not really wait for transaction commit while holding that lock).

Fix the problem by moving jbd2_might_wait_for_commit() into places where
we are ready to wait for transaction commit and thus j_state_lock is
unlocked.

Cc: stable@vger.kernel.org
Fixes: 1eaa566d36
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-22 11:44:06 -04:00
Peter Zijlstra
87709e28dc fs/locks: Use percpu_down_read_preempt_disable()
Avoid spurious preemption.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:54 +02:00
Peter Zijlstra
7c3f654d8e fs/locks: Replace lg_local with a per-cpu spinlock
As Oleg suggested, replace file_lock_list with a structure containing
the hlist head and a spinlock.

This completely removes the lglock from fs/locks.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:53 +02:00
Peter Zijlstra
aba3766073 fs/locks: Replace lg_global with a percpu-rwsem
Replace the global part of the lglock with a percpu-rwsem.

Since fcl_lock is a spinlock and itself nests under i_lock, which too
is a spinlock we cannot acquire sleeping locks at
locks_{insert,remove}_global_locks().

We can however wrap all fcl_lock acquisitions with percpu_down_read
such that all invocations of locks_{insert,remove}_global_locks() have
that read lock held.

This allows us to replace the lg_global part of the lglock with the
write side of the rwsem.

In the absense of writers, percpu_{down,up}_read() are free of atomic
instructions. This further avoids the very long preempt-disable
regions caused by lglock on larger machines.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:25:53 +02:00
Jan Kara
030b533c4f fs: Avoid premature clearing of capabilities
Currently, notify_change() clears capabilities or IMA attributes by
calling security_inode_killpriv() before calling into ->setattr. Thus it
happens before any other permission checks in inode_change_ok() and user
is thus allowed to trigger clearing of capabilities or IMA attributes
for any file he can look up e.g. by calling chown for that file. This is
unexpected and can lead to user DoSing a system.

Fix the problem by calling security_inode_killpriv() at the end of
inode_change_ok() instead of from notify_change(). At that moment we are
sure user has permissions to do the requested change.

References: CVE-2015-1350
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
6249033076 fuse: Propagate dentry down to inode_change_ok()
To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. Propagate it down to fuse_do_setattr().

Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
fd5472ed44 ceph: Propagate dentry down to inode_change_ok()
To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. ceph_setattr() has the dentry easily available but
__ceph_setattr() is also called from ceph_set_acl() where dentry is not
easily available. Luckily that call path does not need inode_change_ok()
to be called anyway. So reorganize functions a bit so that
inode_change_ok() is called only from paths where dentry is available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
69bca80744 xfs: Propagate dentry down to inode_change_ok()
To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. Propagate dentry down to functions calling inode_change_ok().
This is rather straightforward except for xfs_set_mode() function which
does not have dentry easily available. Luckily that function does not
call inode_change_ok() anyway so we just have to do a little dance with
function prototypes.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
073931017b posix_acl: Clear SGID bit when setting file permissions
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok().  Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2).  Fix that.

References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2016-09-22 10:55:32 +02:00
Jeff Mahoney
325c50e3ce btrfs: ensure that file descriptor used with subvol ioctls is a dir
If the subvol/snapshot create/destroy ioctls are passed a regular file
with execute permissions set, we'll eventually Oops while trying to do
inode->i_op->lookup via lookup_one_len.

This patch ensures that the file descriptor refers to a directory.

Fixes: cb8e70901d (Btrfs: Fix subvolume creation locking rules)
Fixes: 76dda93c6a (Btrfs: add snapshot/subvolume destroy ioctl)
Cc: <stable@vger.kernel.org> #v2.6.29+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21 17:22:16 -07:00
Josef Bacik
1e5ec2e709 Btrfs: handle quota reserve failure properly
btrfs/022 was spitting a warning for the case that we exceed the quota.  If we
fail to make our quota reservation we need to clean up our data space
reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21 17:22:16 -07:00
Chao Yu
e0d735c1cc gfs2: fix to detect failure of register_shrinker
register_shrinker can fail after commit 1d3d4437ea ("vmscan: per-node
deferred work"), we should detect the failure of it, otherwise we may
fail to register shrinker after gfs2 module was been inited successfully.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-09-21 12:09:40 -05:00
Martin Brandenburg
0c95ad7636 orangefs: bump minimum userspace version
OrangeFS 2.9.6 was released without support for the features op. Thus
OrangeFS 2.9.7 will be required to use it.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2016-09-21 12:37:23 -04:00
Christian Lamparter
86f0e06767 debugfs: introduce a public file_operations accessor
This patch introduces an accessor which can be used
by the users of debugfs (drivers, fs, ...) to get the
original file_operations struct. It also removes the
REAL_FOPS_DEREF macro in file.c and converts the code
to use the public version.

Previously, REAL_FOPS_DEREF was only available within
the file.c of debugfs. But having a public getter
available for debugfs users is important as some
drivers (carl9170 and b43) use the pointer of the
original file_operations in conjunction with container_of()
within their debugfs implementations.

Reviewed-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
Cc: stable <stable@vger.kernel.org> # 4.7+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-21 12:13:31 +02:00
Jiri Olsa
df04abfd18 fs/proc/kcore.c: Add bounce buffer for ktext data
We hit hardened usercopy feature check for kernel text access by reading
kcore file:

  usercopy: kernel memory exposure attempt detected from ffffffff8179a01f (<kernel text>) (4065 bytes)
  kernel BUG at mm/usercopy.c:75!

Bypassing this check for kcore by adding bounce buffer for ktext data.

Reported-by: Steve Best <sbest@redhat.com>
Fixes: f5509cc18d ("mm: Hardened usercopy")
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-20 13:32:49 -07:00
Jiri Olsa
f5beeb1851 fs/proc/kcore.c: Make bounce buffer global for read
Next patch adds bounce buffer for ktext area, so it's
convenient to have single bounce buffer for both
vmalloc/module and ktext cases.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-20 13:32:49 -07:00
Ingo Molnar
41a66072c3 Merge branch 'efi/urgent' into efi/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20 16:58:59 +02:00
Ingo Molnar
b2c16e1efd Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20 08:29:21 +02:00
Junxiao Bi
63b52c4936 Revert "ocfs2: bump up o2cb network protocol version"
This reverts commit 38b52efd21 ("ocfs2: bump up o2cb network protocol
version").

This commit made rolling upgrade fail.  When one node is upgraded to new
version with this commit, the remaining nodes will fail to establish
connections to it, then the application like VMs on the remaining nodes
can't be live migrated to the upgraded one.  This will cause an outage.
Since negotiate hb timeout behavior didn't change without this commit,
so revert it.

Fixes: 38b52efd21 ("ocfs2: bump up o2cb network protocol version")
Link: http://lkml.kernel.org/r/1471396924-10375-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Ashish Samant
d21c353d5e ocfs2: fix start offset to ocfs2_zero_range_for_truncate()
If we punch a hole on a reflink such that following conditions are met:

1. start offset is on a cluster boundary
2. end offset is not on a cluster boundary
3. (end offset is somewhere in another extent) or
   (hole range > MAX_CONTIG_BYTES(1MB)),

we dont COW the first cluster starting at the start offset.  But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out.  This will modify the
cluster in place and zero it in the source too.

Fix this by skipping this cluster in such a scenario.

To reproduce:

1. Create a random file of say 10 MB
     xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
2. Reflink  it
     reflink -f 10MBfile reflnktest
3. Punch a hole at starting at cluster boundary  with range greater that
1MB. You can also use a range that will put the end offset in another
extent.
     fallocate -p -o 0 -l 1048615 reflnktest
4. sync
5. Check the  first cluster in the source file. (It will be zeroed out).
    dd if=10MBfile iflag=direct bs=<cluster size> count=1 | hexdump -C

Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reported-by: Saar Maoz <saar.maoz@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Eric Ren <zren@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Joseph Qi
3bb8b653c8 ocfs2: fix double unlock in case retry after free truncate log
If ocfs2_reserve_cluster_bitmap_bits() fails with ENOSPC, it will try to
free truncate log and then retry.  Since ocfs2_try_to_free_truncate_log
will lock/unlock global bitmap inode, we have to unlock it before
calling this function.  But when retry reserve and it fails with no
global bitmap inode lock taken, it will unlock again in error handling
branch and BUG.

This issue also exists if no need retry and then ocfs2_inode_lock fails.
So fix it.

Fixes: 2070ad1aeb ("ocfs2: retry on ENOSPC if sufficient space in truncate log")
Link: http://lkml.kernel.org/r/57D91939.6030809@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Jan Kara
96d41019e3 fanotify: fix list corruption in fanotify_get_response()
fanotify_get_response() calls fsnotify_remove_event() when it finds that
group is being released from fanotify_release() (bypass_perm is set).

However the event it removes need not be only in the group's notification
queue but it can have already moved to access_list (userspace read the
event before closing the fanotify instance fd) which is protected by a
different lock.  Thus when fsnotify_remove_event() races with
fanotify_release() operating on access_list, the list can get corrupted.

Fix the problem by moving all the logic removing permission events from
the lists to one place - fanotify_release().

Fixes: 5838d4442b ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Jan Kara
12703dbfeb fsnotify: add a way to stop queueing events on group shutdown
Implement a function that can be called when a group is being shutdown
to stop queueing new events to the group.  Fanotify will use this.

Fixes: 5838d4442b ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Junxiao Bi
d5bf141893 ocfs2: fix trans extend while free cached blocks
The root cause of this issue is the same with the one fixed by the last
patch, but this time credits for allocator inode and group descriptor
may not be consumed before trans extend.

The following error was caught:

  WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 0 PID: 2037 Comm: rm Tainted: G        W       4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2]
    ocfs2_run_deallocs+0x70/0x270 [ocfs2]
    ocfs2_commit_truncate+0x474/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlinkat+0x22/0x40
    system_call_fastpath+0x12/0x71
  ---[ end trace a62437cb060baa71 ]---
  JBD2: rm wants too many credits (149 > 128)

Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Junxiao Bi
2b0ad0085a ocfs2: fix trans extend while flush truncate log
Every time, ocfs2_extend_trans() included a credit for truncate log
inode, but as that inode had been managed by jbd2 running transaction
first time, it will not consume that credit until
jbd2_journal_restart().

Since total credits to extend always included the un-consumed ones,
there will be more and more un-consumed credit, at last
jbd2_journal_restart() will fail due to credit number over the half of
max transction credit.

The following error was caught when unlinking a large file with many
extents:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 0 PID: 13626 Comm: unlink Tainted: G        W       4.1.12-37.6.3.el6uek.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_replay_truncate_records+0x93/0x360 [ocfs2]
    __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2]
    ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2]
    ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlink+0x16/0x20
    system_call_fastpath+0x12/0x71
  ---[ end trace 28aa7410e69369cf ]---
  JBD2: unlink wants too many credits (251 > 128)

Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Kirill A. Shutemov
31b4beb473 ipc/shm: fix crash if CONFIG_SHMEM is not set
Commit c01d5b3007 ("shmem: get_unmapped_area align huge page") makes
use of shm_get_unmapped_area() in shm_file_operations() unconditional to
CONFIG_MMU.

As Tony Battersby pointed this can lead NULL-pointer dereference on
machine with CONFIG_MMU=y and CONFIG_SHMEM=n.  In this case ipc/shm is
backed by ramfs which doesn't provide f_op->get_unmapped_area for
configurations with MMU.

The solution is to provide dummy f_op->get_unmapped_area for ramfs when
CONFIG_MMU=y, which just call current->mm->get_unmapped_area().

Fixes: c01d5b3007 ("shmem: get_unmapped_area align huge page")
Link: http://lkml.kernel.org/r/20160912102704.140442-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Tested-by: Tony Battersby <tonyb@cybernetics.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>	[4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Ian Kent
7cbdb4a286 autofs: use dentry flags to block walks during expire
Somewhere along the way the autofs expire operation has changed to hold
a spin lock over expired dentry selection.  The autofs indirect mount
expired dentry selection is complicated and quite lengthy so it isn't
appropriate to hold a spin lock over the operation.

Commit 47be61845c ("fs/dcache.c: avoid soft-lockup in dput()") added a
might_sleep() to dput() causing a WARN_ONCE() about this usage to be
issued.

But the spin lock doesn't need to be held over this check, the autofs
dentry info.  flags are enough to block walks into dentrys during the
expire.

I've left the direct mount expire as it is (for now) because it is much
simpler and quicker than the indirect mount expire and adding spin lock
release and re-aquires would do nothing more than add overhead.

Fixes: 47be61845c ("fs/dcache.c: avoid soft-lockup in dput()")
Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Reported-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Joseph Qi
e6f0c6e617 ocfs2/dlm: fix race between convert and migration
Commit ac7cf246df ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not.  This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.

In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.

Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery.  So fix it.

Fixes: ac7cf246df ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:16 -07:00
Al Viro
5d3ddd84ea udf: don't bother with full-page write optimisations in adinicb case
... it would get converted to regular if such had been attempted

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-19 10:47:01 +02:00
Christoph Hellwig
25f4e70291 ext2: use iomap to implement DAX
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:30:29 +10:00
Christoph Hellwig
6750ad7198 ext2: stop passing buffer_head to ext2_get_blocks
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:28:39 +10:00
Christoph Hellwig
6c31f495d1 xfs: use iomap to implement DAX
Another users of buffer_heads bytes the dust.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:28:38 +10:00
Christoph Hellwig
e372843a40 xfs: refactor xfs_setfilesize
Rename the current function to __xfs_setfilesize and add a non-static
wrapper that also takes care of creating the transaction.  This new
helper will be used by the new iomap-based DAX path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:26:41 +10:00
Christoph Hellwig
66642c5c1d xfs: take the ilock shared if possible in xfs_file_iomap_begin
We always just read the extent first, and will later lock exlusively
after first dropping the lock in case we actually allocate blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:26:39 +10:00
Christoph Hellwig
17879e8f86 xfs: fix locking for DAX writes
So far DAX writes inherited the locking from direct I/O writes, but
the direct I/O model of using shared locks for writes is actually
wrong for DAX.  For direct I/O we're out of any standards and don't
have to provide the Posix required exclusion between writers, but
for DAX which gets transparently enable on applications without any
knowledge of it we can't simply drop the requirement.  Even worse
this only happens for aligned writes and thus doesn't show up for
many typical use cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:50 +10:00
Christoph Hellwig
a7d73fe6c5 dax: provide an iomap based fault handler
Very similar to the existing dax_fault function, but instead of using
the get_block callback we rely on the iomap_ops vector from iomap.c.
That also avoids having to do two calls into the file system for write
faults.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:50 +10:00
Christoph Hellwig
a254e56812 dax: provide an iomap based dax read/write path
This is a much simpler implementation of the DAX read/write path
that makes use of the iomap infrastructure.  It does not try to
mirror the direct I/O calling conventions and thus doesn't have to
deal with i_dio_count or the end_io handler, but instead leaves
locking and filesystem-specific I/O completion to the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:49 +10:00
Christoph Hellwig
b0d5e82fcf dax: don't pass buffer_head to copy_user_dax
This way we can use this helper for the iomap based DAX implementation
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:49 +10:00
Christoph Hellwig
1aaba0958e dax: don't pass buffer_head to dax_insert_mapping
This way we can use this helper for the iomap based DAX implementation
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:49 +10:00
Christoph Hellwig
befb503ca6 iomap: expose iomap_apply outside iomap.c
This allows the DAX code to use it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:49 +10:00
Christoph Hellwig
ecd50729f7 iomap: add IOMAP_F_NEW flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:24:37 +10:00
Christoph Hellwig
51446f5ba4 xfs: rewrite and optimize the delalloc write path
Currently xfs_iomap_write_delay does up to lookups in the inode
extent tree, which is rather costly especially with the new iomap
based write path and small write sizes.

But it turns out that the low-level xfs_bmap_search_extents gives us
all the information we need in the regular delalloc buffered write
path:

 - it will return us an extent covering the block we are looking up
   if it exists.  In that case we can simply return that extent to
   the caller and are done
 - it will tell us if we are beyoned the last current allocated
   block with an eof return parameter.  In that case we can create a
   delalloc reservation and use the also returned information about
   the last extent in the file as the hint to size our delalloc
   reservation.
 - it can tell us that we are writing into a hole, but that there is
   an extent beyoned this hole.  In this case we can create a
   delalloc reservation that covers the requested size (possible
   capped to the next existing allocation).

All that can be done in one single routine instead of bouncing up
and down a few layers.  This reduced the CPU overhead of the block
mapping routines and also simplified the code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:10:21 +10:00
Christoph Hellwig
85a6e764ff xfs: make xfs_inode_set_eofblocks_tag cheaper for the common case
For long growing file writes we will usually already have the
eofblocks tag set when adding more speculative preallocations.  Add
a flag in the inode to allow us to skip the the fairly expensive
AG-wide spinlocks and multiple radix tree operations in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:09:48 +10:00
Christoph Hellwig
f8e3a82575 xfs: factor our a helper to calculate the EOF alignment
And drop the pointless mp argument to xfs_iomap_eof_align_last_fsb,
while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:09:28 +10:00
Christoph Hellwig
e9c4973638 xfs: move xfs_bmbt_to_iomap up
We'll need it earlier in the file soon, so the unchanged function to
the top of xfs_iomap.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:09:12 +10:00
Darrick J. Wong
3fd129b63f xfs: set up per-AG free space reservations
One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents.  Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.

Therefore, create two per-AG block reservation pools.  One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG.  Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available.  On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:30:52 +10:00
Darrick J. Wong
385d655861 xfs: defer should allow ->finish_item to request a new transaction
When xfs_defer_finish calls ->finish_item, it's possible that
(refcount) won't be able to finish all the work in a single
transaction.  When this happens, the ->finish_item handler should
shorten the log done item's list count, update the work item to
reflect where work should continue, and return -EAGAIN so that
defer_finish knows to retain the pending item on the pending list,
roll the transaction, and restart processing where we left off.

Plumb in the code and document how this mechanism is supposed to work.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-19 10:26:25 +10:00
Darrick J. Wong
c611cc0360 xfs: count the blocks in a btree
Provide a helper method to count the number of blocks in a short form
btree.  The refcount and rmap btrees need to know the number of blocks
already in use to set up their per-AG block reservations during mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:25:20 +10:00
Darrick J. Wong
4ed3f68792 xfs: create a standard btree size calculator code
Create a helper to generate AG btree height calculator functions.
This will be used (much) later when we get to the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:25:03 +10:00
Darrick J. Wong
a1d46cffaf xfs: remove xfs_btree_bigkey
Remove the xfs_btree_bigkey mess and simply make xfs_btree_key big enough
to hold both keys in-core.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:24:36 +10:00
Darrick J. Wong
cd00158ce3 xfs: convert RUI log formats to use variable length arrays
Use variable length array declarations for RUI log items,
and replace the open coded sizeof formulae with a single function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:24:27 +10:00
Darrick J. Wong
e43c460dcd iomap: add a flag to report shared extents
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:13:02 +10:00
Christoph Hellwig
5f4e5752a8 fs: add iomap_file_dirty
Originally-From: Christoph Hellwig <hch@lst.de>

This function uses the iomap infrastructure to re-write all pages
in a given range.  This is useful for doing a copy-up of COW ranges,
and might be useful for scrubbing in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:12:45 +10:00
Linus Torvalds
4d2899d73c Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Small set of cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Move check for prefix path to within cifs_get_root()
  Compare prepaths when comparing superblocks
  Fix memory leaks in cifs_do_mount()
2016-09-16 17:09:48 -07:00
Mike Galbraith
420902c9d0 reiserfs: Unlock superblock before calling reiserfs_quota_on_mount()
If we hold the superblock lock while calling reiserfs_quota_on_mount(), we can
deadlock our own worker - mount blocks kworker/3:2, sleeps forever more.

crash> ps|grep UN
    715      2   3  ffff880220734d30  UN   0.0       0      0  [kworker/3:2]
   9369   9341   2  ffff88021ffb7560  UN   1.3  493404 123184  Xorg
   9665   9664   3  ffff880225b92ab0  UN   0.0   47368    812  udisks-daemon
  10635  10403   3  ffff880222f22c70  UN   0.0   14904    936  mount
crash> bt ffff880220734d30
PID: 715    TASK: ffff880220734d30  CPU: 3   COMMAND: "kworker/3:2"
 #0 [ffff8802244c3c20] schedule at ffffffff8144584b
 #1 [ffff8802244c3cc8] __rt_mutex_slowlock at ffffffff814472b3
 #2 [ffff8802244c3d28] rt_mutex_slowlock at ffffffff814473f5
 #3 [ffff8802244c3dc8] reiserfs_write_lock at ffffffffa05f28fd [reiserfs]
 #4 [ffff8802244c3de8] flush_async_commits at ffffffffa05ec91d [reiserfs]
 #5 [ffff8802244c3e08] process_one_work at ffffffff81073726
 #6 [ffff8802244c3e68] worker_thread at ffffffff81073eba
 #7 [ffff8802244c3ec8] kthread at ffffffff810782e0
 #8 [ffff8802244c3f48] kernel_thread_helper at ffffffff81450064
crash> rd ffff8802244c3cc8 10
ffff8802244c3cc8:  ffffffff814472b3 ffff880222f23250   .rD.....P2."....
ffff8802244c3cd8:  0000000000000000 0000000000000286   ................
ffff8802244c3ce8:  ffff8802244c3d30 ffff880220734d80   0=L$.....Ms ....
ffff8802244c3cf8:  ffff880222e8f628 0000000000000000   (.."............
ffff8802244c3d08:  0000000000000000 0000000000000002   ................
crash> struct rt_mutex ffff880222e8f628
struct rt_mutex {
  wait_lock = {
    raw_lock = {
      slock = 65537
    }
  },
  wait_list = {
    node_list = {
      next = 0xffff8802244c3d48,
      prev = 0xffff8802244c3d48
    }
  },
  owner = 0xffff880222f22c71,
  save_state = 0
}
crash> bt 0xffff880222f22c70
PID: 10635  TASK: ffff880222f22c70  CPU: 3   COMMAND: "mount"
 #0 [ffff8802216a9868] schedule at ffffffff8144584b
 #1 [ffff8802216a9910] schedule_timeout at ffffffff81446865
 #2 [ffff8802216a99a0] wait_for_common at ffffffff81445f74
 #3 [ffff8802216a9a30] flush_work at ffffffff810712d3
 #4 [ffff8802216a9ab0] schedule_on_each_cpu at ffffffff81074463
 #5 [ffff8802216a9ae0] invalidate_bdev at ffffffff81178aba
 #6 [ffff8802216a9af0] vfs_load_quota_inode at ffffffff811a3632
 #7 [ffff8802216a9b50] dquot_quota_on_mount at ffffffff811a375c
 #8 [ffff8802216a9b80] finish_unfinished at ffffffffa05dd8b0 [reiserfs]
 #9 [ffff8802216a9cc0] reiserfs_fill_super at ffffffffa05de825 [reiserfs]
    RIP: 00007f7b9303997a  RSP: 00007ffff443c7a8  RFLAGS: 00010202
    RAX: 00000000000000a5  RBX: ffffffff8144ef12  RCX: 00007f7b932e9ee0
    RDX: 00007f7b93d9a400  RSI: 00007f7b93d9a3e0  RDI: 00007f7b93d9a3c0
    RBP: 00007f7b93d9a2c0   R8: 00007f7b93d9a550   R9: 0000000000000001
    R10: ffffffffc0ed040e  R11: 0000000000000202  R12: 000000000000040e
    R13: 0000000000000000  R14: 00000000c0ed040e  R15: 00007ffff443ca20
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mike Galbraith <mgalbraith@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-16 17:20:59 +02:00
Phil Turnbull
42857cf512 configfs: Return -EFBIG from configfs_write_bin_file.
The check for writing more than cb_max_size bytes does not 'goto out' so
it is a no-op which allows users to vmalloc an arbitrary amount.

Fixes: 03607ace80 ("configfs: implement binary attributes")
Cc: stable@kernel.org
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-09-16 12:58:28 +02:00
Miklos Szeredi
814184fd40 vfat: don't use ->d_time
Use d_fsdata instead, which is the same size.  Introduce helpers to hide
the typecasts.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
a00be0e31f cifs: don't use ->d_time
Use d_fsdata instead, which is the same size.  Introduce helpers to hide
the typecasts.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Steve French <sfrench@samba.org>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
beaf226b86 posix_acl: don't ignore return value of posix_acl_create_masq()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
280db3c88c f2fs: use filemap_check_errors()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
f031221001 btrfs: use filemap_check_errors()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Chris Mason <clm@fb.com>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
4d0c5ba2ff vfs: do get_write_access() on upper layer of overlayfs
The problem with writecount is: we want consistent handling of it for
underlying filesystems as well as overlayfs.  Making sure i_writecount is
correct on all layers is difficult.  Instead this patch makes sure that
when write access is acquired, it's always done on the underlying writable
layer (called the upper layer).  We must also make sure to look at the
writecount on this layer when checking for conflicting leases.

Open for write already updates the upper layer's writecount.  Leaving only
truncate.

For truncate copy up must happen before get_write_access() so that the
writecount is updated on the upper layer.  Problem with this is if
something fails after that, then copy-up was done needlessly.  E.g. if
break_lease() was interrupted.  Probably not a big deal in practice.

Another interesting case is if there's a denywrite on a lower file that is
then opened for write or truncated.  With this patch these will succeed,
which is somewhat counterintuitive.  But I think it's still acceptable,
considering that the copy-up does actually create a different file, so the
old, denywrite mapping won't be touched.

On non-overlayfs d_real() is an identity function and d_real_inode() is
equivalent to d_inode() so this patch doesn't change behavior in that case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
2016-09-16 12:44:21 +02:00
Miklos Szeredi
c568d68341 locks: fix file locking on overlayfs
This patch allows flock, posix locks, ofd locks and leases to work
correctly on overlayfs.

Instead of using the underlying inode for storing lock context use the
overlay inode.  This allows locks to be persistent across copy-up.

This is done by introducing locks_inode() helper and using it instead of
file_inode() to get the inode in locking code.  For non-overlayfs the two
are equivalent, except for an extra pointer dereference in locks_inode().

Since lock operations are in "struct file_operations" we must also make
sure not to call underlying filesystem's lock operations.  Introcude a
super block flag MS_NOREMOTELOCK to this effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
2016-09-16 12:44:20 +02:00
Miklos Szeredi
598e3c8f72 vfs: update ovl inode before relatime check
On overlayfs relatime_need_update() needs inode times to be correct on
overlay inode.  But i_mtime and i_ctime are updated by filesystem code on
underlying inode only, so they will be out-of-date on the overlay inode.

This patch copies the times from the underlying inode if needed.  This
can't be done if called from RCU lookup (link following) but link m/ctime
are not updated by fs, so this is all right.

This patch doesn't change functionality for anything but overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-16 12:44:20 +02:00
Miklos Szeredi
f2b20f6ee8 vfs: move permission checking into notify_change() for utimes(NULL)
This fixes a bug where the permission was not properly checked in
overlayfs.  The testcase is ltp/utimensat01.

It is also cleaner and safer to do the permission checking in the vfs
helper instead of the caller.

This patch introduces an additional ia_valid flag ATTR_TOUCH (since
touch(1) is the most obvious user of utimes(NULL)) that is passed into
notify_change whenever the conditions for this special permission checking
mode are met.

Reported-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Aihua Zhang <zhangaihua1@huawei.com>
Cc: <stable@vger.kernel.org> # v3.18+
2016-09-16 12:44:20 +02:00
Jann Horn
22f6b4d34f aio: mark AIO pseudo-fs noexec
This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set.  Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.

I have tested the patch on my machine.

To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include <unistd.h>
    #include <sys/personality.h>
    #include <linux/aio_abi.h>
    #include <err.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/syscall.h>

    int main(void) {
        personality(READ_IMPLIES_EXEC);
        aio_context_t ctx = 0;
        if (syscall(__NR_io_setup, 1, &ctx))
            err(1, "io_setup");

        char cmd[1000];
        sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
            (int)getpid());
        system(cmd);
        return 0;
    }

In the output, "rw-s" is good, "rwxs" is bad.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-15 15:49:28 -07:00
Eric Biggers
ef1eb3aa50 fscrypto: make filename crypto functions return 0 on success
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure.  However, the output
length was redundant with the value written to 'oname->len'.  It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.

Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead.  For consistency also make other callers check for
a nonzero result rather than a negative result.

This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-15 17:25:55 -04:00
Eric Biggers
53fd7550ec fscrypto: rename completion callbacks to reflect usage
fscrypt_complete() was used only for data pages, not for all
encryption/decryption.  Rename it to page_crypt_complete().

dir_crypt_complete() was used for filename encryption/decryption for
both directory entries and symbolic links.  Rename it to
fname_crypt_complete().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 16:51:01 -04:00
Jaegeuk Kim
5905f9afa2 f2fs: handle error in recover_orphan_inode
This patch enhances the error path in recover_orphan_inode.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-15 13:50:24 -07:00
Eric Biggers
d83ae730b6 fscrypto: remove unnecessary includes
This patch removes some #includes that are clearly not needed, such as a
reference to ecryptfs, which is unrelated to the new filesystem
encryption code.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 16:41:09 -04:00
Darrick J. Wong
b71dbf1032 vfs: cap dedupe request structure size at PAGE_SIZE
Kirill A Shutemov reports that the kernel doesn't try to cap dest_count
in any way, and uses the number to allocate kernel memory.  This causes
high order allocation warnings in the kernel log if someone passes in a
big enough value.  We should clamp the allocation at PAGE_SIZE to avoid
stressing the VM.

The two existing users of the dedupe ioctl never send more than 120
requests, so we can safely clamp dest_range at PAGE_SIZE, because with
4k pages we can handle up to 127 dedupe candidates.  Given the max
extent length of 16MB, we can end up doing 2GB of IO which is plenty.

[ Note: the "offsetof()" can't overflow, because 'count' is just a
  16-bit integer.  That's not obvious in the limited context of the
  patch, so I'm noting it here because it made me go look.  - Linus ]

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-15 13:29:52 -07:00
Darrick J. Wong
5297e0f0fe vfs: fix return type of ioctl_file_dedupe_range
All the VFS functions in the dedupe ioctl path return int status, so
the ioctl handler ought to as well.

Found by Coverity, CID 1350952.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-15 13:29:52 -07:00
Eric Biggers
8f39850dff fscrypto: improved validation when loading inode encryption metadata
- Validate fscrypt_context.format and fscrypt_context.flags.  If
  unrecognized values are set, then the kernel may not know how to
  interpret the encrypted file, so it should fail the operation.

- Validate that AES_256_XTS is used for contents and that AES_256_CTS is
  used for filenames.  It was previously possible for the kernel to
  accept these reversed, though it would have taken manual editing of
  the block device.  This was not intended.

- Fail cleanly rather than BUG()-ing if a file has an unexpected type.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 13:32:11 -04:00
Eric Biggers
dcce7a46c6 ext4: fix memory leak when symlink decryption fails
This bug was introduced in v4.8-rc1.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-09-15 13:13:13 -04:00
Geliang Tang
f0c9fd5458 jbd2: move more common code into journal_init_common()
There are some repetitive code in jbd2_journal_init_dev() and
jbd2_journal_init_inode(). So this patch moves the common code into
journal_init_common() helper to simplify the code. And fix the coding
style warnings reported by checkpatch.pl by the way.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-09-15 12:02:32 -04:00
Fabian Frederick
be32197cd6 ext4: remove unused definition for MAX_32_NUM
MAX_32_NUM isn't used in ext4

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:58:47 -04:00
Fabian Frederick
518eaa6387 ext4: create EXT4_MAX_BLOCKS() macro
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:55:01 -04:00
Fabian Frederick
c3fe493ccd ext4: remove unneeded test in ext4_alloc_file_blocks()
ext4_alloc_file_blocks() is called from ext4_zero_range() and
ext4_fallocate() both already testing EXT4_INODE_EXTENTS
We can call ext_depth(inode) unconditionnally.

[ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
  called for a indirect-mapped inode in the future.  -- tytso ]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:52:07 -04:00
Fabian Frederick
edf15aa180 ext4: fix memory leak in ext4_insert_range()
Running xfstests generic/013 with kmemleak gives the following:

unreferenced object 0xffff8801d3d27de0 (size 96):
  comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
    [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
    [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
    [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
    [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
    [<ffffffff81181334>] vfs_fallocate+0x134/0x210
    [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
    [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
    [<ffffffffffffffff>] 0xffffffffffffffff

Problem seems mitigated by dropping refs and freeing path
when there's no path[depth].p_ext

Cc: stable@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:39:52 -04:00
wangguang
4e800c0359 ext4: bugfix for mmaped pages in mpage_release_unused_pages()
Pages clear buffers after ext4 delayed block allocation failed,
However, it does not clean its pte_dirty flag.
if the pages unmap ,in cording to the pte_dirty ,
unmap_page_range may try to call __set_page_dirty,

which may lead to the bugon at 
mpage_prepare_extent_to_map:head = page_buffers(page);.

This patch just call clear_page_dirty_for_io to clean pte_dirty 
at mpage_release_unused_pages for pages mmaped. 

Steps to reproduce the bug:

(1) mmap a file in ext4
	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
	       	            fd, 0);
	memset(addr, 'i', 4096);

(2) return EIO at 

	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent 

which causes this log message to be print:

                ext4_msg(sb, KERN_CRIT,
                        "Delayed block allocation failed for "
                        "inode %lu at logical offset %llu with"
                        " max blocks %u with error %d",
                        inode->i_ino,
                        (unsigned long long)map->m_lblk,
                        (unsigned)map->m_len, -err);

(3)Unmap the addr cause warning at

	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));

(4) wait for a minute,then bugon happen.

Cc: stable@vger.kernel.org
Signed-off-by: wangguang <wangguang03@zte.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:32:46 -04:00
Ingo Molnar
d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Tiezhu Yang
49ed09dd85 f2fs: remove dead code f2fs_check_acl
The macro f2fs_check_acl is defined but never used since
the initial commit, this patch removes the code that has
been dead for several years.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-14 16:52:36 -07:00
Fan Li
d95fd91c1a f2fs: exclude special cases for f2fs_move_file_range
When src and dst is the same file, and the latter part of source region
overlaps with the former part of destination region, current implement
will overwrite data which hasn't been moved yet and truncate data in
overlapped region.
This patch return -EINVAL when such cases occur and return 0 when
source region and destination region is actually the same part of
the same file.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-14 16:52:06 -07:00
Dmitry Safonov
90954e7b94 x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
Killed PR_REG_SIZE and PR_REG_PTR macro as we can get regset size
from regset view.
I wish I could also kill PRSTATUS_SIZE nicely.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-5-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-14 21:28:10 +02:00
Bart Van Assche
4382e33ad3 block, dm-crypt, btrfs: Introduce bio_flags()
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:27 -06:00
Linus Walleij
a441b0d093 block: remove remnant refs to hardsect
commit e1defc4ff0
"block: Do away with the notion of hardsect_size"
removed the notion of "hardware sector size" from
the kernel in favor of logical block size, but
references remain in comments and documentation.

Update the remaining sites mentioning hardsect.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:44:57 -06:00
Christoph Hellwig
2237570168 block_dev: remove DAX leftovers
DAX support for block devices was removed in commits 03cdad
("block: disable block device DAX by default") and 99a01cd
("block: remove BLK_DEV_DAX config option"), but we still kept a call to
dax_do_io and some uneeded i_flags manipulations introduced in commit
bbab37 ("block: Add support for DAX reads/writes to block devices").

Remove those leftovers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:41:59 -06:00
Eric Sandeen
7716981273 xfs: normalize "infinite" retries in error configs
As it stands today, the "fail immediately" vs. "retry forever"
values for max_retries and retry_timeout_seconds in the xfs metadata
error configurations are not consistent.

A retry_timeout_seconds of 0 means "retry forever," but a
max_retries of 0 means "fail immediately."

retry_timeout_seconds < 0 is disallowed, while max_retries == -1
means "retry forever."

Make this consistent across the error configs, such that a value of
0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
and a value of -1 always means "retry forever."

This makes retry_timeout a signed long to accommodate the -1, even
though it stores jiffies.  Given our limit of a 1 day maximum
timeout, this should be sufficient even at much higher HZ values
than we have available today.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14 07:51:30 +10:00
Xie XiuQi
79c350e45e xfs: fix signed integer overflow
Use 1U for unsigned int to avoid a overflow warning from UBSAN.

[   31.910858] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:889:25
[   31.911252] signed integer overflow:
[   31.911478] -2147483648 - 1 cannot be represented in type 'int'
[   31.911846] CPU: 1 PID: 1011 Comm: tuned Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
[   31.911857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
[   31.911866]  1ffff1004069cd3b 0000000076bec3fd ffff8802034e69a0 ffffffff81ee3140
[   31.911883]  ffff8802034e69b8 ffffffff81ee31fd ffffffffa0ad79e0 ffff8802034e6b20
[   31.911898]  ffffffff81ee46e2 0000002d515470c0 0000000000000001 0000000041b58ab3
[   31.911913] Call Trace:
[   31.911932]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
[   31.911947]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
[   31.911964]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
[   31.912083]  [<ffffffff81ee4798>] __ubsan_handle_sub_overflow+0x2a/0x31
[   31.912204]  [<ffffffffa08676fb>] xfs_buf_item_log+0x34b/0x3f0 [xfs]
[   31.912314]  [<ffffffffa0880490>] xfs_trans_log_buf+0x120/0x260 [xfs]
[   31.912402]  [<ffffffffa079a890>] xfs_btree_log_recs+0x80/0xc0 [xfs]
[   31.912490]  [<ffffffffa07a29f8>] xfs_btree_delrec+0x11a8/0x2d50 [xfs]
[   31.913589]  [<ffffffffa07a86f9>] xfs_btree_delete+0xc9/0x260 [xfs]
[   31.913762]  [<ffffffffa075b5cf>] xfs_free_ag_extent+0x63f/0xe20 [xfs]
[   31.914339]  [<ffffffffa075ec0f>] xfs_free_extent+0x2af/0x3e0 [xfs]
[   31.914641]  [<ffffffffa0801b2b>] xfs_bmap_finish+0x32b/0x4b0 [xfs]
[   31.914841]  [<ffffffffa083c2e7>] xfs_itruncate_extents+0x3b7/0x740 [xfs]
[   31.915216]  [<ffffffffa08342fa>] xfs_setattr_size+0x60a/0x860 [xfs]
[   31.915471]  [<ffffffffa08345ea>] xfs_vn_setattr+0x9a/0xe0 [xfs]
[   31.915590]  [<ffffffff8149ad38>] notify_change+0x5c8/0x8a0
[   31.915607]  [<ffffffff81450f22>] do_truncate+0x122/0x1d0
[   31.915640]  [<ffffffff8147beee>] do_last+0x15de/0x2c80
[   31.915707]  [<ffffffff8147d777>] path_openat+0x1e7/0xcc0
[   31.915802]  [<ffffffff81480824>] do_filp_open+0xa4/0x160
[   31.915848]  [<ffffffff81453127>] do_sys_open+0x1b7/0x3f0
[   31.915879]  [<ffffffff81453392>] SyS_open+0x32/0x40
[   31.915897]  [<ffffffff81f08989>] system_call_fastpath+0x16/0x1b

[  240.086809] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:866:34
[  240.086820] signed integer overflow:
[  240.086830] -2147483648 - 1 cannot be represented in type 'int'
[  240.086846] CPU: 1 PID: 12969 Comm: rm Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
[  240.086857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
[  240.086868]  1ffff10040491def 00000000e2ea59c1 ffff88020248ef40 ffffffff81ee3140
[  240.086885]  ffff88020248ef58 ffffffff81ee31fd ffffffffa0ad79e0 ffff88020248f0c0
[  240.086901]  ffffffff81ee46e2 0000002d02488000 0000000000000001 0000000041b58ab3
[  240.086915] Call Trace:
[  240.086938]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
[  240.086953]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
[  240.086971]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
...

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14 07:41:16 +10:00
Artem Savkov
791cc43b36 Make __xfs_xattr_put_listen preperly report errors.
Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent"
changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
insufficient space in the buffer assuming that setting context->count to -1
would be enough, but all of the ->put_listent callers only check seen_enough.
This results in a failed assertion:
XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
in insufficient buffer size case.

This is only reproducible with at least 2 xattrs and only when the buffer
gets depleted before the last one.

Furthermore if buffersize is such that it is enough to hold the last xattr's
name, but not enough to hold the sum of preceeding xattr names listxattr won't
fail with ERANGE, but will suceed returning last xattr's name without the
first character. The first character end's up overwriting data stored at
(context->alist - 1).

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14 07:40:35 +10:00
Eryu Guan
a27f6ef4e6 xfs: undo block reservation correctly in xfs_trans_reserve()
"blocks" should be added back to fdblocks at undo time, not taken
away, i.e. the minus sign should not be used.

This is a regression introduced by commit 0d485ada40 ("xfs: use
generic percpu counters for free block counter"). And it's found by
code inspection, I didn't it in real world, so there's no
reproducer.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14 07:39:07 +10:00
Jaegeuk Kim
649d7df29c f2fs: fix to set PageUptodate in f2fs_write_end correctly
Previously, f2fs_write_begin sets PageUptodate all the time. But, when user
tries to update the entire page (i.e., len == PAGE_SIZE), we need to consider
that the page is able to be copied partially afterwards. In such the case,
we will lose the remaing region in the page.

This patch fixes this by setting PageUptodate in f2fs_write_end as given copied
result. In the short copy case, it returns zero to let generic_perform_write
retry copying user data again.

As a result, f2fs_write_end() works:
   PageUptodate      len      copied    return   retry
1. no                4096     4096      4096     false  -> return 4096
2. no                4096     1024      0        true   -> goto #1 case
3. yes               2048     2048      2048     false  -> return 2048
4. yes               2048     1024      1024     false  -> return 1024

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-13 13:02:34 -07:00
Fan Li
61e4da1172 f2fs: fix parameters of __exchange_data_block
__exchange_data_block should take block indexes as parameters
instead of offsets in bytes.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-13 13:02:33 -07:00
Jaegeuk Kim
e8ea9b3d7e f2fs: avoid ENOMEM during roll-forward recovery
This patch gives another chances during roll-forward recovery regarding to
-ENOMEM.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-13 13:02:29 -07:00
David S. Miller
b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Linus Torvalds
2c937eb4dd NFS client bugfixes for 4.8
Highlights include:
 
 Stable patches:
 - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state
   accounting
 - Fix the CREATE_SESSION slot number
 
 Bugfixes:
 - sunrpc: fix a UDP memory accounting regression
 - NFS: Fix an error reporting regression in nfs_file_write()
 - pNFS: Fix further layout stateid issues
 - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
 - RPC/rdma: Fix receive buffer accounting
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX1wEwAAoJEGcL54qWCgDysPMP/iEgzv6Peky9DVYG35btxZXC
 QQxZDfvOa3Xxe9cH0JwfyisaDHw2gO5RQqFFCCxA/x0dZsf2s3Nrjt6C9yH8q7qF
 i8c1OQ8oEBMgM+BsByCQniUubSaAvs2jVVpAs7G+eOYPSqxFKzsHJwDqqRp4aZrW
 YDohIumsHFoKl1GYCx9jv44wtmQQJjgIJ0Uq8SJvMkSzzRaGgVIeCbfpRgtqVD3g
 mU8k3XV0C+fnLgtwtlG1dkqbnuNSp1gT72f8joId+SJjtnGgjxqi0eIn48vY5k4N
 SJ5+4N6Uko87k9uQ2zn1UTR2Jrltn7mtMI7RHJVuiLnbZjAsf0lfOIF3sgItAwhS
 G0F/EHzMbt3+vs4P9EsGJgTcViVplgJeXw0hQIqXbJN0IwsXG0/UYGuPUFxtMOHQ
 +ko8BYJaNWcQCVdkFc5rVyt/tM6rKDahLlA3sIn3bCGssL67CYgkfNsBIoOEmjp9
 u4XTYwJYD2hXMpskc8W623voQ2/VDbbWB6bphmZH9EeOvlzRB5TW5OvEB0VE805+
 WYZal32LNnaUE4rpUtr78rYEvzPqn7tb9+OglP/tYa1QB3A0nwC9f74CDQ6s08oR
 K00fVXu9yffkBty8Cm0e4HpUcjT+95BMVdJUJU3lhbUbu+eq74L/32OSjuGmdRWf
 c4S6sHfgCeX6uJPCb2rD
 =j4kB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
     state accounting
   - Fix the CREATE_SESSION slot number

  Bugfixes:
   - sunrpc: fix a UDP memory accounting regression
   - NFS: Fix an error reporting regression in nfs_file_write()
   - pNFS: Fix further layout stateid issues
   - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer
     exhaustion...")
   - RPC/rdma: Fix receive buffer accounting"

* tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Fix the CREATE_SESSION slot number accounting
  xprtrdma: Fix receive buffer accounting
  xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
  pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
  pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
  pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
  pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
  NFS: Fix error reporting in nfs_file_write()
  sunrpc: fix UDP memory accounting
2016-09-12 14:13:45 -07:00
Jaegeuk Kim
f4702d61eb f2fs: add common iget in add_fsync_inode
There is no functional change.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12 13:55:11 -07:00
Jaegeuk Kim
7f3037a5ec f2fs: check free_sections for defragmentation
Fix wrong condition check for defragmentation of a file.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12 10:30:41 -07:00
Yunlei He
ed214a1183 f2fs: forbid to do fstrim if fs has some error
This patch skip fstrim if sbi set SBI_NEED_FSCK flag

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12 10:30:40 -07:00
Jaegeuk Kim
34b5d5c22d f2fs: avoid page allocation for truncating partial inline_data
When truncating cached inline_data, we don't need to allocate a new page
all the time. Instead, it must check its page cache only.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12 10:30:39 -07:00
Trond Myklebust
b519d408ea NFSv4.1: Fix the CREATE_SESSION slot number accounting
Ensure that we conform to the algorithm described in RFC5661, section
18.36.4 for when to bump the sequence id. In essence we do it for all
cases except when the RPC call timed out, or in case of the server returning
NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
2016-09-11 14:56:44 -04:00
Linus Torvalds
98ac9a608d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "nvdimm fixes for v4.8, two of them are tagged for -stable:

   - Fix devm_memremap_pages() to use track_pfn_insert().  Otherwise,
     DAX pmd mappings end up with an uncached pgprot, and unusable
     performance for the device-dax interface.  The device-dax interface
     appeared in 4.7 so this is tagged for -stable.

   - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
     understand DAX pmd entries.  This fix is tagged for -stable.

   - Fix a mis-merge of the nfit machine-check handler to flip the
     polarity of an if() to match the final version of the patch that
     Vishal sent for 4.8-rc1.  Without this the nfit machine check
     handler never detects / inserts new 'badblocks' entries which
     applications use to identify lost portions of files.

   - For test purposes, fix the nvdimm_clear_poison() path to operate on
     legacy / simulated nvdimm memory ranges.  Without this fix a test
     can set badblocks, but never clear them on these ranges.

   - Fix the range checking done by dax_dev_pmd_fault().  This is not
     tagged for -stable since this problem is mitigated by specifying
     aligned resources at device-dax setup time.

  These patches have appeared in a next release over the past week.  The
  recent rebase you can see in the timestamps was to drop an invalid fix
  as identified by the updated device-dax unit tests [1].  The -mm
  touches have an ack from Andrew"

[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
   https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: allow legacy (e820) pmem region to clear bad blocks
  nfit, mce: Fix SPA matching logic in MCE handler
  mm: fix cache mode of dax pmd mappings
  mm: fix show_smap() for zone_device-pmd ranges
  dax: fix mapping size check
2016-09-10 09:58:52 -07:00
Linus Torvalds
6905732c80 Fix some brown-paper-bag bugs for fscrypto, including one one which
allows a malicious user to set an encryption policy on an empty
 directory which they do not own.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJX05q4AAoJEPL5WVaVDYGjOywH/AyXoo4d1/5H/XTakNYPxYIW
 vtBOXciHai4ZE9RygL3gdZuiyY9bTx2sc80So3KboNUdiuOJBPnuAkOQMr973UCI
 yGW3eP/RYGA1XQUbtOyFvzJMIZLKXV2ytakFeRz+m1CQF2F5F7/prKQB2j4sWHff
 JigAC67LlZSiz7L8SqtPG4uG1C8K/YEorf14dG6k37fMwE/AaBYXxkyc7MmHIKeW
 Tils0ZZcTK0U0udNSel/jRSS/qENEuLvKhFsMAlIDrCETVMidCvv2OAcT0z0z5Ln
 v+Oq0Xfutd12nfb95LUfROMtTzrtILYC2qNfDChOoFtlU8UyKmY+WT1GfYUiy8g=
 =ahmA
 -----END PGP SIGNATURE-----

Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull fscrypto fixes fromTed Ts'o:
 "Fix some brown-paper-bag bugs for fscrypto, including one one which
  allows a malicious user to set an encryption policy on an empty
  directory which they do not own"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fscrypto: require write access to mount to set encryption policy
  fscrypto: only allow setting encryption policy on directories
  fscrypto: add authorization check for setting encryption policy
2016-09-10 09:18:33 -07:00
Eric Biggers
ba63f23d69 fscrypto: require write access to mount to set encryption policy
Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem.  This was handled correctly by f2fs but not by ext4.  Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-10 01:18:57 -04:00
Sachin Prabhu
348c1bfa84 Move check for prefix path to within cifs_get_root()
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09 23:58:07 -05:00
Sachin Prabhu
c1d8b24d18 Compare prepaths when comparing superblocks
The patch
fs/cifs: make share unaccessible at root level mountable
makes use of prepaths when any component of the underlying path is
inaccessible.

When mounting 2 separate shares having different prepaths but are other
wise similar in other respects, we end up sharing superblocks when we
shouldn't be doing so.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09 23:58:06 -05:00
Sachin Prabhu
4214ebf465 Fix memory leaks in cifs_do_mount()
Fix memory leaks introduced by the patch
fs/cifs: make share unaccessible at root level mountable

Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb().

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09 23:58:06 -05:00
Eric Biggers
002ced4be6 fscrypto: only allow setting encryption policy on directories
The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption
policy on nondirectory files.  This was unintentional, and in the case
of nonempty regular files did not behave as expected because existing
data was not actually encrypted by the ioctl.

In the case of ext4, the user could also trigger filesystem errors in
->empty_dir(), e.g. due to mismatched "directory" checksums when the
kernel incorrectly tried to interpret a regular file as a directory.

This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with
kernels v4.6 and later.  It appears that older kernels only permitted
directories and that the check was accidentally lost during the
refactoring to share the file encryption code between ext4 and f2fs.

This patch restores the !S_ISDIR() check that was present in older
kernels.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-09 23:38:12 -04:00
Eric Biggers
163ae1c6ad fscrypto: add authorization check for setting encryption policy
On an ext4 or f2fs filesystem with file encryption supported, a user
could set an encryption policy on any empty directory(*) to which they
had readonly access.  This is obviously problematic, since such a
directory might be owned by another user and the new encryption policy
would prevent that other user from creating files in their own directory
(for example).

Fix this by requiring inode_owner_or_capable() permission to set an
encryption policy.  This means that either the caller must own the file,
or the caller must have the capability CAP_FOWNER.

(*) Or also on any regular file, for f2fs v4.6 and later and ext4
    v4.8-rc1 and later; a separate bug fix is coming for that.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-09 23:37:14 -04:00
Dan Williams
ca120cf688 mm: fix show_smap() for zone_device-pmd ranges
Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

 kernel BUG at mm/huge_memory.c:1105!
 task: ffff88045f16b140 task.stack: ffff88045be14000
 RIP: 0010:[<ffffffff81268f9b>]  [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340
 [..]
 Call Trace:
  [<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0
  [<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0
  [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
  [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
  [<ffffffff81307656>] show_smap+0xa6/0x2b0

 kernel BUG at fs/proc/task_mmu.c:585!
 RIP: 0010:[<ffffffff81306469>]  [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0
 Call Trace:
  [<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0
  [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
  [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
  [<ffffffff81307696>] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09 17:34:45 -07:00
Linus Torvalds
6dc728ccd3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
 "This fixes a deadlock when fuse, direct I/O and loop device are
  combined"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: direct-io: don't dirty ITER_BVEC pages
2016-09-09 13:00:41 -07:00
Linus Torvalds
5c44ad6a35 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fix from Miklos Szeredi:
 "This fixes a regression caused by the last pull request"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix workdir creation
2016-09-09 12:56:28 -07:00
Linus Torvalds
f4a9c169c2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I'm not proud of how long it took me to track down that one liner in
  btrfs_sync_log(), but the good news is the patches I was trying to
  blame for these problems were actually fine (sorry Filipe)"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
  btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
  btrfs: do not decrease bytes_may_use when replaying extents
2016-09-09 12:52:31 -07:00
Matt Fleming
22c2b77f41 fs/efivarfs: Fix double kfree() in error path
Julia reported that we may double free 'name' in efivarfs_callback(),
and that this bug was introduced by commit 0d22f33bc37c ("efi: Don't
use spinlocks for efi vars").

Move one of the kfree()s until after the point at which we know we are
definitely on the success path.

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Sylvain Chouleur <sylvain.chouleur@gmail.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:48 +01:00
Sylvain Chouleur
21b3ddd39f efi: Don't use spinlocks for efi vars
All efivars operations are protected by a spinlock which prevents
interruptions and preemption. This is too restricted, we just need a
lock preventing concurrency.
The idea is to use a semaphore of count 1 and to have two ways of
locking, depending on the context:
- In interrupt context, we call down_trylock(), if it fails we return
  an error
- In normal context, we call down_interruptible()

We don't use a mutex here because the mutex_trylock() function must not
be called from interrupt context, whereas the down_trylock() can.

Signed-off-by: Sylvain Chouleur <sylvain.chouleur@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sylvain Chouleur <sylvain.chouleur@gmail.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:42 +01:00
Geliang Tang
f88baf68eb ramoops: move spin_lock_init after kmalloc error checking
If cxt->pstore.buf allocated failed, no need to initialize
cxt->pstore.buf_lock. So this patch moves spin_lock_init() after the
error checking.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08 15:01:13 -07:00
Andrew Bresticker
d771fdf941 pstore/ram: Use memcpy_fromio() to save old buffer
The ramoops buffer may be mapped as either I/O memory or uncached
memory.  On ARM64, this results in a device-type (strongly-ordered)
mapping.  Since unnaligned accesses to device-type memory will
generate an alignment fault (regardless of whether or not strict
alignment checking is enabled), it is not safe to use memcpy().
memcpy_fromio() is guaranteed to only use aligned accesses, so use
that instead.

Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Signed-off-by: Enric Balletbo Serra <enric.balletbo@collabora.com>
Reviewed-by: Puneet Kumar <puneetster@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2016-09-08 15:01:12 -07:00
Furquan Shaikh
7e75678d23 pstore/ram: Use memcpy_toio instead of memcpy
persistent_ram_update uses vmap / iomap based on whether the buffer is in
memory region or reserved region. However, both map it as non-cacheable
memory. For armv8 specifically, non-cacheable mapping requests use a
memory type that has to be accessed aligned to the request size. memcpy()
doesn't guarantee that.

Signed-off-by: Furquan Shaikh <furquan@google.com>
Signed-off-by: Enric Balletbo Serra <enric.balletbo@collabora.com>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Olof Johansson <olofj@chromium.org>
Tested-by: Furquan Shaikh <furquan@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2016-09-08 15:01:11 -07:00
Mark Salyzyn
5bf6d1b927 pstore/pmsg: drop bounce buffer
Removing a bounce buffer copy operation in the pmsg driver path is
always better. We also gain in overall performance by not requesting
a vmalloc on every write as this can cause precious RT tasks, such
as user facing media operation, to stall while memory is being
reclaimed. Added a write_buf_user to the pstore functions, a backup
platform write_buf_user that uses the small buffer that is part of
the instance, and implemented a ramoops write_buf_user that only
supports PSTORE_TYPE_PMSG.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08 15:01:10 -07:00
Namhyung Kim
79d955af71 pstore/ram: Set pstore flags dynamically
The ramoops can be configured to enable each pstore type by setting
their size.  In that case, it'd be better not to register disabled types
in the first place.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08 15:01:09 -07:00
Namhyung Kim
c950fd6f20 pstore: Split pstore fragile flags
This patch adds new PSTORE_FLAGS for each pstore type so that they can
be enabled separately.  This is a preparation for ongoing virtio-pstore
work to support those types flexibly.

The PSTORE_FLAGS_FRAGILE is changed to PSTORE_FLAGS_DMESG to preserve the
original behavior.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-acpi@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
[kees: retained "FRAGILE" for now to make merges easier]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08 15:01:08 -07:00
Sebastian Andrzej Siewior
d5a9bf0b38 pstore/core: drop cmpxchg based updates
I have here a FPGA behind PCIe which exports SRAM which I use for
pstore. Now it seems that the FPGA no longer supports cmpxchg based
updates and writes back 0xff…ff and returns the same.  This leads to
crash during crash rendering pstore useless.
Since I doubt that there is much benefit from using cmpxchg() here, I am
dropping this atomic access and use the spinlock based version.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Rabin Vincent <rabinv@axis.com>
Tested-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
[kees: remove "_locked" suffix since it's the only option now]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2016-09-08 15:00:47 -07:00
Sebastian Andrzej Siewior
4407de74df pstore/ramoops: fixup driver removal
A basic rmmod ramoops segfaults. Let's see why.

Since commit 34f0ec82e0 ("pstore: Correct the max_dump_cnt clearing of
ramoops") sets ->max_dump_cnt to zero before looping over ->przs but we
didn't use it before that either.

And since commit ee1d267423 ("pstore: add pstore unregister") we free
that memory on rmmod.

But even then, we looped until a NULL pointer or ERR. I don't see where
it is ensured that the last member is NULL. Let's try this instead:
simply error recovery and free. Clean up in error case where resources
were allocated. And then, in the free path, rely on ->max_dump_cnt in
the free path.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # 4.4.x-
2016-09-08 14:58:00 -07:00
David Howells
248f219cb8 rxrpc: Rewrite the data and ack handling code
Rewrite the data and ack handling code such that:

 (1) Parsing of received ACK and ABORT packets and the distribution and the
     filing of DATA packets happens entirely within the data_ready context
     called from the UDP socket.  This allows us to process and discard ACK
     and ABORT packets much more quickly (they're no longer stashed on a
     queue for a background thread to process).

 (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim().  We instead
     keep track of the offset and length of the content of each packet in
     the sk_buff metadata.  This means we don't do any allocation in the
     receive path.

 (3) Jumbo DATA packet parsing is now done in data_ready context.  Rather
     than cloning the packet once for each subpacket and pulling/trimming
     it, we file the packet multiple times with an annotation for each
     indicating which subpacket is there.  From that we can directly
     calculate the offset and length.

 (4) A call's receive queue can be accessed without taking locks (memory
     barriers do have to be used, though).

 (5) Incoming calls are set up from preallocated resources and immediately
     made live.  They can than have packets queued upon them and ACKs
     generated.  If insufficient resources exist, DATA packet #1 is given a
     BUSY reply and other DATA packets are discarded).

 (6) sk_buffs no longer take a ref on their parent call.

To make this work, the following changes are made:

 (1) Each call's receive buffer is now a circular buffer of sk_buff
     pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
     between the call and the socket.  This permits each sk_buff to be in
     the buffer multiple times.  The receive buffer is reused for the
     transmit buffer.

 (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
     to the data buffer.  Transmission phase annotations indicate whether a
     buffered packet has been ACK'd or not and whether it needs
     retransmission.

     Receive phase annotations indicate whether a slot holds a whole packet
     or a jumbo subpacket and, if the latter, which subpacket.  They also
     note whether the packet has been decrypted in place.

 (3) DATA packet window tracking is much simplified.  Each phase has just
     two numbers representing the window (rx_hard_ack/rx_top and
     tx_hard_ack/tx_top).

     The hard_ack number is the sequence number before base of the window,
     representing the last packet the other side says it has consumed.
     hard_ack starts from 0 and the first packet is sequence number 1.

     The top number is the sequence number of the highest-numbered packet
     residing in the buffer.  Packets between hard_ack+1 and top are
     soft-ACK'd to indicate they've been received, but not yet consumed.

     Four macros, before(), before_eq(), after() and after_eq() are added
     to compare sequence numbers within the window.  This allows for the
     top of the window to wrap when the hard-ack sequence number gets close
     to the limit.

     Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
     to indicate when rx_top and tx_top point at the packets with the
     LAST_PACKET bit set, indicating the end of the phase.

 (4) Calls are queued on the socket 'receive queue' rather than packets.
     This means that we don't need have to invent dummy packets to queue to
     indicate abnormal/terminal states and we don't have to keep metadata
     packets (such as ABORTs) around

 (5) The offset and length of a (sub)packet's content are now passed to
     the verify_packet security op.  This is currently expected to decrypt
     the packet in place and validate it.

     However, there's now nowhere to store the revised offset and length of
     the actual data within the decrypted blob (there may be a header and
     padding to skip) because an sk_buff may represent multiple packets, so
     a locate_data security op is added to retrieve these details from the
     sk_buff content when needed.

 (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
     individually secured and needs to be individually decrypted.  The code
     to do this is broken out into rxrpc_recvmsg_data() and shared with the
     kernel API.  It now iterates over the call's receive buffer rather
     than walking the socket receive queue.

Additional changes:

 (1) The timers are condensed to a single timer that is set for the soonest
     of three timeouts (delayed ACK generation, DATA retransmission and
     call lifespan).

 (2) Transmission of ACK and ABORT packets is effected immediately from
     process-context socket ops/kernel API calls that cause them instead of
     them being punted off to a background work item.  The data_ready
     handler still has to defer to the background, though.

 (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
     filesystem can shut down the socket and flush its own work items
     before closing the socket to deal with any in-progress service calls.

Future additional changes that will need to be considered:

 (1) Make sure that a call doesn't hog the front of the queue by receiving
     data from the network as fast as userspace is consuming it to the
     exclusion of other calls.

 (2) Transmit delayed ACKs from within recvmsg() when we've consumed
     sufficiently more packets to avoid the background work item needing to
     run.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells
00e907127e rxrpc: Preallocate peers, conns and calls for incoming service requests
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need.  The idea
is to cut out the background thread usage as much as possible.

[Note that the preallocated structs are not actually used in this patch -
 that will be done in a future patch.]

If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.

To this end:

 (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
     maximum number each of the listen backlog size.  The backlog size is
     limited to a maxmimum of 32.  Only this many of each can be in the
     preallocation buffer.

 (2) For userspace sockets, the preallocation is charged initially by
     listen() and will be recharged by accepting or rejecting pending
     new incoming calls.

 (3) For kernel services {,re,dis}charging of the preallocation buffers is
     handled manually.  Two notifier callbacks have to be provided before
     kernel_listen() is invoked:

     (a) An indication that a new call has been instantiated.  This can be
     	 used to trigger background recharging.

     (b) An indication that a call is being discarded.  This is used when
     	 the socket is being released.

     A function, rxrpc_kernel_charge_accept() is called by the kernel
     service to preallocate a single call.  It should be passed the user ID
     to be used for that call and a callback to associate the rxrpc call
     with the kernel service's side of the ID.

 (4) Discard the preallocation when the socket is closed.

 (5) Temporarily bump the refcount on the call allocated in
     rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
     preallocation ref on service calls unconditionally.  This will no
     longer be necessary once the preallocation is used.

Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.

A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer.  This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
Jaegeuk Kim
68f313935f f2fs: no need to make zeros beyond i_size
We don't need to make zeros beyond i_size, since we already wrote that through
NEW_ADDR case.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 18:53:50 -07:00
Chao Yu
7732c26ac3 f2fs: fix to detect temporary name of multimedia file
Some applications may create multimeida file with temporary name like
'*.jpg.tmp' or '*.mp4.tmp', then rename to '*.jpg' or '*.mp4'.

Now, f2fs can only detect multimedia filename with specified format:
"filename + '.' + extension", so it will make f2fs missing to detect
multimedia file with special temporary name, result in failing to set
cold flag on file.

This patch enhances detection flow for enabling lookup extension in the
middle of temporary filename.

Reported-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 18:53:49 -07:00
Chao Yu
6ab2a3085e f2fs: fix minor typo
Correct typo from 'destory' to 'destroy'.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 18:53:48 -07:00
Jaegeuk Kim
6bf6b267d2 f2fs: set dentry bits on random location in memory
This fixes pointer panic when using inline_dentry, which was triggered when
backporting to 3.10.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 18:53:47 -07:00
Chao Yu
c2a080aefa f2fs: fix to set superblock dirty correctly
tests/generic/251 of fstest suit complains us with below message:

------------[ cut here ]------------
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 7698 Comm: fstrim Tainted: G           O    4.7.0+ #21
task: e9f4e000 task.stack: e7262000
EIP: 0060:[<f89fcefe>] EFLAGS: 00010202 CPU: 2
EIP is at write_checkpoint+0xfde/0x1020 [f2fs]
EAX: f33eb300 EBX: eecac310 ECX: 00000001 EDX: ffff0001
ESI: eecac000 EDI: eecac5f0 EBP: e7263dec ESP: e7263d18
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: b76ab01c CR3: 2eb89de0 CR4: 000406f0
Stack:
 00000001 a220fb7b e9f4e000 00000002 419ff2d3 b3a05151 00000002 e9f4e5d8
 e9f4e000 419ff2d3 b3a05151 eecac310 c10b8154 b3a05151 419ff2d3 c10b78bd
 e9f4e000 e9f4e000 e9f4e5d8 00000001 e9f4e000 ec409000 eecac2cc eecac288
Call Trace:
 [<c10b8154>] ? __lock_acquire+0x3c4/0x760
 [<c10b78bd>] ? mark_held_locks+0x5d/0x80
 [<f8a10632>] f2fs_trim_fs+0x1c2/0x2e0 [f2fs]
 [<f89e9f56>] f2fs_ioctl+0x6b6/0x10b0 [f2fs]
 [<c13d51df>] ? __this_cpu_preempt_check+0xf/0x20
 [<c10b4281>] ? trace_hardirqs_off_caller+0x91/0x120
 [<f89e98a0>] ? __exchange_data_block+0xd30/0xd30 [f2fs]
 [<c120b2e1>] do_vfs_ioctl+0x81/0x7f0
 [<c11d57c5>] ? kmem_cache_free+0x245/0x2e0
 [<c1217840>] ? get_unused_fd_flags+0x40/0x40
 [<c1206eec>] ? putname+0x4c/0x50
 [<c11f631e>] ? do_sys_open+0x16e/0x1d0
 [<c1001990>] ? do_fast_syscall_32+0x30/0x1c0
 [<c13d51df>] ? __this_cpu_preempt_check+0xf/0x20
 [<c120baa8>] SyS_ioctl+0x58/0x80
 [<c1001a01>] do_fast_syscall_32+0xa1/0x1c0
 [<c178cc54>] sysenter_past_esp+0x45/0x74
EIP: [<f89fcefe>] write_checkpoint+0xfde/0x1020 [f2fs] SS:ESP 0068:e7263d18
---[ end trace 4de95d7e6b3aa7c6 ]---

The reason is: with below call stack, we will encounter BUG_ON during
doing fstrim.

Thread A				Thread B
- write_checkpoint
 - do_checkpoint
					- f2fs_write_inode
					 - update_inode_page
					  - update_inode
					   - set_page_dirty
					    - f2fs_set_node_page_dirty
					     - inc_page_count
					      - percpu_counter_inc
					      - set_sbi_flag(SBI_IS_DIRTY)
  - clear_sbi_flag(SBI_IS_DIRTY)

Thread C				Thread D
- f2fs_write_node_page
 - set_node_addr
  - __set_nat_cache_dirty
   - nm_i->dirty_nat_cnt++
					- do_vfs_ioctl
					 - f2fs_ioctl
					  - f2fs_trim_fs
					   - write_checkpoint
					    - f2fs_bug_on(nm_i->dirty_nat_cnt)

Fix it by setting superblock dirty correctly in do_checkpoint and
f2fs_write_node_page.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 18:53:47 -07:00
Shuoran Liu
e7ba108a06 f2fs: add roll-forward recovery process for encrypted dentry
Add roll-forward recovery process for encrypted dentry, so the first fsync
issued to an encrypted file does not need writing checkpoint.

This improves the performance of the following test at thousands of small
files: open -> write -> fsync -> close

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify kernel message to show encrypted names]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:40 -07:00
Jaegeuk Kim
bbf156f7af f2fs: fix lost xattrs of directories
This patch enhances the xattr consistency of dirs from suddern power-cuts.

Possible scenario would be:
1. dir->setxattr used by per-file encryption
2. file->setxattr goes into inline_xattr
3. file->fsync

In that case, we should do checkpoint for #1.
Otherwise we'd lose dir's key information for the file given #2.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:39 -07:00
Chao Yu
275b66b09e f2fs: support async discard
Like most filesystems, f2fs will issue discard command synchronously, so
when user trigger fstrim through ioctl, multiple discard commands will be
issued serially with sync mode, which makes poor performance.

In this patch we try to support async discard, so that all discard
commands can be issued and be waited for endio in batch to improve
performance.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:38 -07:00
Shuoran Liu
167451efb5 f2fs: set encryption name flag in add inline entry path
This patch sets encryption name flag in the add inline entry path
if filename is encrypted.

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:37 -07:00
Chao Yu
e06f86e61d f2fs crypto: avoid unneeded memory allocation in ->readdir
When decrypting dirents in ->readdir, fscrypt_fname_disk_to_usr won't
change content of original encrypted dirent, we don't need to allocate
additional buffer for storing mirror of it, so get rid of it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:36 -07:00
Chao Yu
9421d57051 f2fs: fix to do security initialization of encrypted inode with original filename
When creating new inode, security_inode_init_security will be called for
initializing security info related to the inode, and filename is passed to
security module, it helps security module such as SElinux to know which
rule or label could be applied for the inode with specified name.

Previously, if new inode is created as an encrypted one, f2fs will transfer
encrypted filename to security module which may fail the check of security
policy belong to the inode. So in order to this issue, alter to transfer
original unencrypted filename instead.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:35 -07:00
Chao Yu
7ea984b060 f2fs: do in batch synchronously readahead during GC
In order to enhance performance, we try to readahead node page during
GC, but before loading node page we should get block address of node page
which is stored in NAT table, so synchronously read of single NAT page
block our readahead flow.

f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xa1e, oldaddr = 0xa1e, newaddr = 0xa1e, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x35e9, oldaddr = 0x72d7a, newaddr = 0x72d7a, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xc1f, oldaddr = 0xc1f, newaddr = 0xc1f, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x389d, oldaddr = 0x72d7d, newaddr = 0x72d7d, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3a82, oldaddr = 0x72d7f, newaddr = 0x72d7f, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3bfa, oldaddr = 0x72d86, newaddr = 0x72d86, rw = READAHEAD ^H, type = NODE

This patch adds one phase that do readahead NAT pages in batch before
readahead node page for more effeciently.

f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0x1952, oldaddr = 0x1952, newaddr = 0x1952, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc34, oldaddr = 0xc34, newaddr = 0xc34, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa33, oldaddr = 0xa33, newaddr = 0xa33, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc30, oldaddr = 0xc30, newaddr = 0xc30, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc32, oldaddr = 0xc32, newaddr = 0xc32, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc26, oldaddr = 0xc26, newaddr = 0xc26, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa2b, oldaddr = 0xa2b, newaddr = 0xa2b, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc23, oldaddr = 0xc23, newaddr = 0xc23, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc24, oldaddr = 0xc24, newaddr = 0xc24, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa10, oldaddr = 0xa10, newaddr = 0xa10, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc2c, oldaddr = 0xc2c, newaddr = 0xc2c, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db7, oldaddr = 0x6be00, newaddr = 0x6be00, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db9, oldaddr = 0x6be17, newaddr = 0x6be17, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dbc, oldaddr = 0x6be1a, newaddr = 0x6be1a, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc3, oldaddr = 0x6be20, newaddr = 0x6be20, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc7, oldaddr = 0x6be24, newaddr = 0x6be24, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc9, oldaddr = 0x6be25, newaddr = 0x6be25, rw = READAHEAD ^H, type = NODE

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:34 -07:00
Chao Yu
74fa5f3d43 f2fs: schedule in between two continous batch discards
In batch discard approach of fstrim will grab/release gc_mutex lock
repeatly, it makes contention of the lock becoming more intensive.

So after one batch discards were issued in checkpoint and the lock
was released, it's better to do schedule() to increase opportunity
of grabbing gc_mutex lock for other competitors.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:33 -07:00
Chris Mason
b7f3c7d345 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8 2016-09-07 12:55:36 -07:00
David Howells
5a42976d4f rxrpc: Add tracepoint for working out where aborts happen
Add a tracepoint for working out where local aborts happen.  Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.

rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 16:34:40 +01:00
Christophe JAILLET
240c5185c5 jfs: Simplify code
Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to
'list_splice_init'.

This has been spotted with the following coccinelle script:
/////
@@
expression y,z;
@@

-   list_splice(y,z);
-   INIT_LIST_HEAD(y);
+   list_splice_init(y,z);

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2016-09-06 12:17:24 -05:00
Jan Kara
f27792f5b7 udf: Remove useless check in udf_adinicb_write_begin()
As Al properly points out, len is guaranteed to be smaller than
PAGE_SIZE when we reach udf_adinicb_write_begin() as otherwise we would
have converted the file to the normal format.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-06 18:04:40 +02:00
Wang Xiaoguang
ce129655c9 btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.

	ticket = list_first_entry(&space_info->tickets,
				  struct reserve_ticket, list);
	if (last_ticket == ticket) {
		flush_state++;
	} else {
		last_ticket = ticket;
		flush_state = FLUSH_DELAYED_ITEMS_NR;
		if (commit_cycles)
			commit_cycles--;
	}

But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,

For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-06 16:31:43 +02:00
Chris Mason
cbd60aa7cd Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
We use a btrfs_log_ctx structure to pass information into the
tree log commit, and get error values out.  It gets added to a per
log-transaction list which we walk when things go bad.

Commit d1433debe added an optimization to skip waiting for the log
commit, but didn't take root_log_ctx out of the list.  This
patch makes sure we remove things before exiting.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: d1433debe7
cc: stable@vger.kernel.org # 3.15+
2016-09-06 05:57:25 -07:00
Dmitry Monakhov
e22834f024 ext4: improve ext4lazyinit scalability
ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
    ->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
  After that it is safe to drop li_list_mtx because all callers of
  li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:38:36 -04:00