4107 Commits

Author SHA1 Message Date
Peng Tao
97ba375b5d pnfs/flexfiles: report layoutstat regularly
As a simple scheme, report every minute if IO is still going on.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:54:23 -04:00
Peng Tao
1bfe3b259f nfs42: serialize LAYOUTSTATS calls of the same file
There is no need to report concurrently.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:11 -04:00
Peng Tao
27c4306443 pnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:11 -04:00
Peng Tao
ad4dc53e64 pnfs/flexfiles: add ff_layout_prepare_layoutstats
It fills in the generic part of LAYOUTSTATS call. One thing to note
is that we don't really track if IO is continuous or not. So just fake
to use the completed bytes for it.

Still missing flexfiles specific part, which will be included in the next patch.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:11 -04:00
Peng Tao
d983803d38 pNFS/flexfiles: track when layout is first used
So that we can report cumulative time since the beginning
of statistics collection of the layout.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:10 -04:00
Trond Myklebust
abcb7bfc9f pNFS/flexfiles: add layoutstats tracking
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:10 -04:00
Trond Myklebust
27797d1bb3 pNFS/flexfiles: Remove unused struct members user_name, group_name
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:17:37 -04:00
Peng Tao
8733408d6e pnfs: add pnfs_report_layoutstat helper function
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:17:37 -04:00
Peng Tao
1b4a4bd82c pNFS: fill in nfs42_layoutstat_ops
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:17:37 -04:00
Trond Myklebust
be3a5d2339 NFSv.2/pnfs Add a LAYOUTSTATS rpc function
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:17:37 -04:00
Al Viro
dc3f4198ea make simple_positive() public
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:02:01 -04:00
Linus Torvalds
052b398a43 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this pile: pathname resolution rewrite.

   - recursion in link_path_walk() is gone.

   - nesting limits on symlinks are gone (the only limit remaining is
     that the total amount of symlinks is no more than 40, no matter how
     nested).

   - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

   - stack footprint (independent of the nesting) is below kilobyte now,
     about on par with what it used to be with one level of nested
     symlinks and ~2.8 times lower than it used to be in the worst case.

   - struct nameidata is entirely private to fs/namei.c now (not even
     opaque pointers are being passed around).

   - ->follow_link() and ->put_link() calling conventions had been
     changed; all in-tree filesystems converted, out-of-tree should be
     able to follow reasonably easily.

     For out-of-tree conversions, see Documentation/filesystems/porting
     for details (and in-tree filesystems for examples of conversion).

  That has sat in -next since mid-May, seems to survive all testing
  without regressions and merges clean with v4.1"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
  turn user_{path_at,path,lpath,path_dir}() into static inlines
  namei: move saved_nd pointer into struct nameidata
  inline user_path_create()
  inline user_path_parent()
  namei: trim do_last() arguments
  namei: stash dfd and name into nameidata
  namei: fold path_cleanup() into terminate_walk()
  namei: saner calling conventions for filename_parentat()
  namei: saner calling conventions for filename_create()
  namei: shift nameidata down into filename_parentat()
  namei: make filename_lookup() reject ERR_PTR() passed as name
  namei: shift nameidata inside filename_lookup()
  namei: move putname() call into filename_lookup()
  namei: pass the struct path to store the result down into path_lookupat()
  namei: uninline set_root{,_rcu}()
  namei: be careful with mountpoint crossings in follow_dotdot_rcu()
  Documentation: remove outdated information from automount-support.txt
  get rid of assorted nameidata-related debris
  lustre: kill unused helper
  lustre: kill unused macro (LOOKUP_CONTINUE)
  ...
2015-06-22 12:51:21 -07:00
Trond Myklebust
1372a3130a Merge branch 'bugfixes'
* bugfixes:
  NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes
  pNFS: Fix a memory leak when attempted pnfs fails
  NFS: Ensure that we update the sequence id under the slot table lock
  nfs: Initialize cb_sequenceres information before validate_seqid()
  nfs: Only update callback sequnce id when CB_SEQUENCE success
  NFSv4: nfs4_handle_delegation_recall_error should ignore EAGAIN
2015-06-22 09:55:08 -04:00
Yijing Wang
dfad7000f3 nfs: Fix comment for nfs_pageio_init() and nfs_pageio_complete_mirror()
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-18 08:59:13 -04:00
Trond Myklebust
c70701131f NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes
If a write attempt fails, and the write is queued up for resending to
the server, as opposed to being dropped, then we need to set the
appropriate flag so that nfs_file_fsync() does the right thing.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-17 20:00:42 -04:00
Trond Myklebust
1ca018d28d pNFS: Fix a memory leak when attempted pnfs fails
pnfs_do_write() expects the call to pnfs_write_through_mds() to free the
pgio header and to release the layout segment before exiting. The problem
is that nfs_pgio_data_destroy() doesn't actually do this; it only frees
the memory allocated by nfs_generic_pgio().

Ditto for pnfs_do_read()...

Fix in both cases is to add a call to hdr->release(hdr).

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-17 20:00:26 -04:00
Trond Myklebust
5ba12443a1 NFSv4: Fix stateid recovery on revoked delegations
Ensure that we fix the non-NULL stateid case as well.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:29:51 -04:00
Olga Kornievskaia
ae2ffef383 Recover from stateid-type error on SETATTR
Client can receives stateid-type error (eg., BAD_STATEID) on SETATTR when
delegation stateid was used. When no open state exists, in case of application
calling truncate() on the file, client has no state to recover and fails with
EIO.

Instead, upon such error, return the bad delegation and then resend the
SETATTR with a zero stateid.

Signed-off: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:29:46 -04:00
Kinglong Mee
df05a49f72 nfs: Fix showing truncated fsid/dev in, /proc/net/nfsfs/volumes
A truncated fsid showing from /proc/fs/nfsfs/volumes as,
NV SERVER   PORT DEV     FSID              FSC
v4 c0a80881  801 0:43    34931f044c2a439b  no

It should be as,
NV SERVER   PORT DEV          FSID                              FSC
v4 c0a80881  801 0:43         34931f044c2a439b:954c5d830fa4be8c no

The max buffer length for storing "%llx:%llx" format should be
 16 + 1 + 16 + 1 = 34 (16 for %llx, 1 for ':', 1 for '\0').

Also, for storing "%u:%u" of MAJOR() and MINOR() should be
 8 + 1 + 3 + 1 = 13 (8 for 2^24, 1 for ':', 3 for 2^8, 1 for '\0').

v2, add comments for dev/fsid buffer and use sizeof in snprintf.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:17:37 -04:00
Jeff Layton
873e385116 nfs: make nfs4_init_uniform_client_string use a dynamically allocated buffer
Change the uniform client string generator to dynamically allocate the
NFSv4 client name string buffer. With this patch, we can eliminate the
buffers that are embedded within the "args" structs and simply use the
name string that is hanging off the client.

This uniform string case is a little simpler than the nonuniform since
we don't need to deal with RCU, but we do have two different cases,
depending on whether there is a uniquifier or not.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:51 -04:00
Jeff Layton
a319268891 nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer
The way the *_client_string functions work is a little goofy. They build
the string in an on-stack buffer and then use kstrdup to copy it. This
is not only stack-heavy but artificially limits the size of the client
name string. Change it so that we determine the length of the string,
allocate it and then scnprintf into it.

Since the contents of the nonuniform string depend on rcu-managed data
structures, it's possible that they'll change between when we allocate
the string and when we go to fill it. If that happens, free the string,
recalculate the length and try again. If it the mismatch isn't resolved
on the second try then just give up and return -EINVAL.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:45 -04:00
Jeff Layton
b8fb2f595e nfs: update maxsz values for SETCLIENTID and EXCHANGE_ID
The spec allows for up to NFS4_OPAQUE_LIMIT (1k). While we'll almost
certainly never use that much, these ops are generally the only ones
in the compound so we might as well allow for them to be that large.

Also, the existing code didn't add in a word for the opaque length
field for either name string. Fix that while we're in there.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:40 -04:00
Jeff Layton
3a6bb73879 nfs: convert setclientid and exchange_id encoders to use clp->cl_owner_id
...instead of buffers that are part of their arg structs. We already
hold a reference to the client, so we might as well use the allocated
buffer. In the event that we can't allocate the clp->cl_owner_id, then
just return -ENOMEM.

Note too that we switch from a GFP_KERNEL allocation here to GFP_NOFS.
It's possible we could end up trying to do a SETCLIENTID or EXCHANGE_ID
in order to reclaim some memory, and the GFP_KERNEL allocations in the
existing code could cause recursion back into NFS reclaim.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:31 -04:00
Fabian Frederick
455b6ee645 pnfs/flexfiles: use swap() in ff_layout_sort_mirrors()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:14:03 -04:00
Trond Myklebust
4e54ab8d8c NFS: Ensure that we update the sequence id under the slot table lock
Fix a callback slot table regression.

Fixes: e937ee714b2d ("nfs: Only update callback sequnce id when CB_SEQUENCE success")
Cc: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 21:15:52 -04:00
Kinglong Mee
0579c8d208 nfs: Initialize cb_sequenceres information before validate_seqid()
For a cb_layoutrecall replay, nfsd got CB_SEQUENCE status of zero,
but all informations of cb_sequenceres are zero too !!!

validate_seqid() return NFS4ERR_RETRY_UNCACHED_REP for a replay,
and skip the initlize cb_sequenceres.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 21:09:06 -04:00
Jeff Layton
6f02dc88be nfs: deny backchannel RPCs with an incorrect authflavor instead of dropping them
A drop should really only be done when the frame is malformed or we have
reason to think that there is some sort of DoS going on. When we get an
RPC with bad auth, we should send back an error instead.

Cc: Andy Adamson <William.Adamson@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 14:06:34 -04:00
Kinglong Mee
e937ee714b nfs: Only update callback sequnce id when CB_SEQUENCE success
When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED.
It is caused by nfs return NFS4ERR_DELAY before validate_seqid(),
don't update the sequnce id, but nfsd updates the sequnce id !!!

According to RFC5661 20.9.3,
" If CB_SEQUENCE returns an error, then the state of the slot
  (sequence ID, cached reply) MUST NOT change. "

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 14:00:40 -04:00
Vaishali Thakkar
4ed0d83d05 NFS: Convert use of __constant_htonl to htonl
In little endian cases, the macro htonl unfolds to __swab32 which
provides special case for constants. In big endian cases,
__constant_htonl and htonl expand directly to the same expression.
So, replace __constant_htonl with htonl with the goal of getting
rid of the definition of __constant_htonl completely.

The semantic patch that performs this transformation is as follows:

@@expression x;@@

- __constant_htonl(x)
+ htonl(x)

Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:57:59 -04:00
Anna Schumaker
11598b8ff2 NFS: Remove unused nfs_rw_ops->rw_release() function
This was only ever set to nfs_writeback_release_common(), a function
which is completely empty.  Let's just drop this function pointer and
simplify the code a bit.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:32:40 -04:00
Dominique Martinet
c86c90c656 NFSv4: handle nfs4_get_referral failure
nfs4_proc_lookup_common is supposed to return a posix error, we have to
handle any error returned that isn't errno

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:28:02 -04:00
Jeff Layton
3c87ef6efb sunrpc: keep a count of swapfiles associated with the rpc_clnt
Jerome reported seeing a warning pop when working with a swapfile on
NFS. The nfs_swap_activate can end up calling sk_set_memalloc while
holding the rcu_read_lock and that function can sleep.

To fix that, we need to take a reference to the xprt while holding the
rcu_read_lock, set the socket up for swapping and then drop that
reference. But, xprt_put is not exported and having NFS deal with the
underlying xprt is a bit of layering violation anyway.

Fix this by adding a set of activate/deactivate functions that take a
rpc_clnt pointer instead of an rpc_xprt, and have nfs_swap_activate and
nfs_swap_deactivate call those.

Also, add a per-rpc_clnt atomic counter to keep track of the number of
active swapfiles associated with it. When the counter does a 0->1
transition, we enable swapping on the xprt, when we do a 1->0 transition
we disable swapping on it.

This also allows us to be a bit more selective with the RPC_TASK_SWAPPER
flag. If non-swapper and swapper clnts are sharing a xprt, then we only
need to flag the tasks from the swapper clnt with that flag.

Acked-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:26:14 -04:00
Trond Myklebust
8eee52af27 NFSv4: nfs4_handle_delegation_recall_error should ignore EAGAIN
EAGAIN is a valid return code from nfs4_open_recover(), and should
be handled by nfs4_handle_delegation_recall_error by simply passing
it through.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-04 13:51:13 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Tejun Heo
93f78d8828 writeback: move backing_dev_info->bdi_stat[] into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->bdi_stat[] into wb.

* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
  enums is changed from BDI_ to WB_.

* BDI_STAT_BATCH() -> WB_STAT_BATCH()

* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

* bdi_stat[_error]() -> wb_stat[_error]()

* bdi_writeout_inc() -> wb_writeout_inc()

* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
  frees stat.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
  introducing no behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Julia Lawall
13985b1f77 NFS: drop unneeded goto
Delete jump to a label on the next line, when that label is not
used elsewhere.

A simplified version of the semantic patch that makes this change is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
identifier l;
@@

-if (...) goto l;
-l:
// </smpl>

Also drop the unnecessary ret variable.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-02 08:55:28 -04:00
Chuck Lever
d683cc49da NFS: Fix size of NFSACL SETACL operations
When encoding the NFSACL SETACL operation, reserve just the estimated
size of the ACL rather than a fixed maximum. This eliminates needless
zero padding on the wire that the server ignores.

Fixes: ee5dc7732bd5 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-02 08:55:28 -04:00
NeilBrown
7ef5ca4fe4 NFS: report more appropriate block size for directories.
In glibc 2.21 (and several previous), a call to opendir() will
result in a 32K (BUFSIZ*4) buffer being allocated and passed to
getdents.

However a call to fdopendir() results in an 'fstat' request to
determine block size and a matching buffer allocated for subsequent
use with getdents.  This will typically be 1M.

The first getdents call on an NFS directory will always use
READDIR_PLUS (or NFSv4 equivalent) if available.  Subsequent getdents
calls only use this more expensive version if some 'stat' requests are
made between the getdents calls.

For this reason it is good to keep at least that first getdents call
relatively short.  When fdopendir() and readdir() is used on a large
directory, it takes approximately 32 times as long to complete as
using "opendir".  Current versions of 'find' use fdopendir() and
demonstrate this slowness.

'stat' on a directory currently returns the 'wsize'.  This number has
no meaning on directories.
Actual READDIR requests are limited to ->dtsize, which itself is
capped at 4 pages, coincidently the same as BUFSIZ*4.
So this is a meaningful number to use as the blocksize on directories,
and has the effect of making 'find' on large directories go a lot
faster.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-02 08:55:27 -04:00
Trond Myklebust
5cae02f427 NFSv4: Always drain the slot table before re-establishing the lease
While the NFSv4.1 code has always drained the slot tables in order to stop
non-recovery related RPC calls when doing lease recovery, the NFSv4 code
did not.
The reason for the difference in behaviour is that NFSv4 does not have
session state, and so RPC calls can in theory proceed while recovery is
happening. In practice, however, anything I/O or state related needs to
wait until recovery is over.

This patch changes the behaviour of NFSv4 to match that of NFSv4.1 so that
we can simplify the state recovery code by assuming that we do not have to
deal with races between recovery and ordinary I/O.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-02 08:55:27 -04:00
Olga Kornievskaia
e8d975e73e fixing infinite OPEN loop in 4.0 stateid recovery
Problem: When an operation like WRITE receives a BAD_STATEID, even though
recovery code clears the RECLAIM_NOGRACE recovery flag before recovering
the open state, because of clearing delegation state for the associated
inode, nfs_inode_find_state_and_recover() gets called and it makes the
same state with RECLAIM_NOGRACE flag again. As a results, when we restart
looking over the open states, we end up in the infinite loop instead of
breaking out in the next test of state flags.

Solution: unset the RECLAIM_NOGRACE set because of
calling of nfs_inode_find_state_and_recover() after returning from calling
recover_open() function.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-01 09:58:02 -04:00
Luis R. Rodriguez
9c27847dda kernel/params: constify struct kernel_param_ops uses
Most code already uses consts for the struct kernel_param_ops,
sweep the kernel for the last offending stragglers. Other than
include/linux/moduleparam.h and kernel/params.c all other changes
were generated with the following Coccinelle SmPL patch. Merge
conflicts between trees can be handled with Coccinelle.

In the future git could get Coccinelle merge support to deal with
patch --> fail --> grammar --> Coccinelle --> new patch conflicts
automatically for us on patches where the grammar is available and
the patch is of high confidence. Consider this a feature request.

Test compiled on x86_64 against:

	* allnoconfig
	* allmodconfig
	* allyesconfig

@ const_found @
identifier ops;
@@

const struct kernel_param_ops ops = {
};

@ const_not_found depends on !const_found @
identifier ops;
@@

-struct kernel_param_ops ops = {
+const struct kernel_param_ops ops = {
};

Generated-by: Coccinelle SmPL
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: cocci@systeme.lip6.fr
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-05-28 11:32:10 +09:30
Jeff Layton
feaff8e5b2 nfs: take extra reference to fl->fl_file when running a setlk
We had a report of a crash while stress testing the NFS client:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
    IP: [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_security ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_filter ip6_tables iptable_security iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw coretemp crct10dif_pclmul ppdev crc32_pclmul crc32c_intel ghash_clmulni_intel vmw_balloon serio_raw vmw_vmci i2c_piix4 shpchp parport_pc acpi_cpufreq parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih mptbase e1000 ata_generic pata_acpi
    CPU: 1 PID: 399 Comm: kworker/1:1H Not tainted 4.1.0-0.rc1.git0.1.fc23.x86_64 #1
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
    Workqueue: rpciod rpc_async_schedule [sunrpc]
    task: ffff880036aea7c0 ti: ffff8800791f4000 task.ti: ffff8800791f4000
    RIP: 0010:[<ffffffff8127b698>]  [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90
    RSP: 0018:ffff8800791f7c00  EFLAGS: 00010293
    RAX: ffff8800791f7c40 RBX: ffff88001f2ad8c0 RCX: ffffe8ffffc80305
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff8800791f7c88 R08: ffff88007fc971d8 R09: 279656d600000000
    R10: 0000034a01000000 R11: 279656d600000000 R12: ffff88001f2ad918
    R13: ffff88001f2ad8c0 R14: 0000000000000000 R15: 0000000100e73040
    FS:  0000000000000000(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000150 CR3: 0000000001c0b000 CR4: 00000000000407e0
    Stack:
     ffffffff8127c5b0 ffff8800791f7c18 ffffffffa0171e29 ffff8800791f7c58
     ffffffffa0171ef8 ffff8800791f7c78 0000000000000246 ffff88001ea0ba00
     ffff8800791f7c40 ffff8800791f7c40 00000000ff5d86a3 ffff8800791f7ca8
    Call Trace:
     [<ffffffff8127c5b0>] ? __posix_lock_file+0x40/0x760
     [<ffffffffa0171e29>] ? rpc_make_runnable+0x99/0xa0 [sunrpc]
     [<ffffffffa0171ef8>] ? rpc_wake_up_task_queue_locked.part.35+0xc8/0x250 [sunrpc]
     [<ffffffff8127cd3a>] posix_lock_file_wait+0x4a/0x120
     [<ffffffffa03e4f12>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4]
     [<ffffffffa03bf108>] ? nfs41_sequence_done+0xd8/0x2d0 [nfsv4]
     [<ffffffffa03c116d>] do_vfs_lock+0x2d/0x30 [nfsv4]
     [<ffffffffa03c251d>] nfs4_lock_done+0x1ad/0x210 [nfsv4]
     [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc]
     [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc]
     [<ffffffffa0171a5c>] rpc_exit_task+0x2c/0xa0 [sunrpc]
     [<ffffffffa0167450>] ? call_refreshresult+0x150/0x150 [sunrpc]
     [<ffffffffa0172640>] __rpc_execute+0x90/0x460 [sunrpc]
     [<ffffffffa0172a25>] rpc_async_schedule+0x15/0x20 [sunrpc]
     [<ffffffff810baa1b>] process_one_work+0x1bb/0x410
     [<ffffffff810bacc3>] worker_thread+0x53/0x480
     [<ffffffff810bac70>] ? process_one_work+0x410/0x410
     [<ffffffff810bac70>] ? process_one_work+0x410/0x410
     [<ffffffff810c0b38>] kthread+0xd8/0xf0
     [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180
     [<ffffffff817a1aa2>] ret_from_fork+0x42/0x70
     [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180

Jean says:

"Running locktests with a large number of iterations resulted in a
 client crash.  The test run took a while and hasn't finished after close
 to 2 hours. The crash happened right after I gave up and killed the test
 (after 107m) with Ctrl+C."

The crash happened because a NULL inode pointer got passed into
locks_get_lock_context. The call chain indicates that file_inode(filp)
returned NULL, which means that f_inode was NULL. Since that's zeroed
out in __fput, that suggests that this filp pointer outlived the last
reference.

Looking at the code, that seems possible. We copy the struct file_lock
that's passed in, but if the task is signalled at an inopportune time we
can end up trying to use that file_lock in rpciod context after the process
that requested it has already returned (and possibly put its filp
reference).

Fix this by taking an extra reference to the filp when we allocate the
lock info, and put it in nfs4_lock_release.

Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-05-13 14:56:06 -04:00
Chuck Lever
6b19687563 nfs: stat(2) fails during cthon04 basic test5 on NFSv4.0
When running the Connectathon basic tests against a Solaris NFS
server over NFSv4.0, test5 reports that stat(2) returns a file size
of zero instead of 1MB.

On success, nfs_commit_inode() can return a positive result; see
other call sites such as nfs_file_fsync_commit() and
nfs_commit_unstable_pages().

The call site recently added in nfs_wb_all() does not prevent that
positive return value from leaking to its callers. If it leaks
through nfs_sync_inode() back to nfs_getattr(), that causes stat(2)
to return a positive return value to user space while also not
filling in the passed-in struct stat.

Additional clean up: the new logic in nfs_wb_all() is rewritten in
bfields-normal form.

Fixes: 5bb89b4702e2 ("NFSv4.1/pnfs: Separate out metadata . . .")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-05-13 14:56:03 -04:00
Al Viro
6e77137b36 don't pass nameidata to ->follow_link()
its only use is getting passed to nd_jump_link(), which can obtain
it from current->nameidata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:20:15 -04:00
Al Viro
680baacbca new ->follow_link() and ->put_link() calling conventions
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
is ignored in all cases except the last one.

Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().

b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is.  In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:19:45 -04:00
Linus Torvalds
59953fba87 NFS client updates for Linux 4.1
Highlights include:
 
 Stable patches:
 - Fix a regression in /proc/self/mountstats
 - Fix the pNFS flexfiles O_DIRECT support
 - Fix high load average due to callback thread sleeping
 
 Bugfixes:
 - Various patches to fix the pNFS layoutcommit support
 - Do not cache pNFS deviceids unless server notifications are enabled
 - Fix a SUNRPC transport reconnection regression
 - make debugfs file creation failure non-fatal in SUNRPC
 - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints
 - Fix locking around NFSv4.2 fallocate() support
 - Truncating NFSv4 file opens should also sync O_DIRECT writes
 - Prevent infinite loop in rpcrdma_ep_create()
 
 Features:
 - Various improvements to the RDMA transport code's handling of memory
   registration
 - Various code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVOmT6AAoJEGcL54qWCgDyrhYQAMPKXB55jrdOR/7UVSF/xPML
 7OjMGHvBnTn/y0pNIyLyS1PjTZZsD/WZjoW9EFGpTv727qQNVoFxFRLNUcgi3NoL
 1YledCkLf7Q32aqod93SRRFPc9hzBoKhOZpOzBuWaAviyAB3KLi70DWAq9qRReYM
 prXUQQjpW5FLU+B2ifaVc2RCnu/rZ2c02YdR2XdtkBaAJxuhB2vR8IY1evwjCv3R
 5zyLDd9zSDDoArdpUzM97cxZPcYRSqbOwgTKvaaRnDDq/mKbKMZaqmEfjblwzNFt
 b43FbveJzZ3hlPADIpmaiMHjRTbxWjIKc9K1sOF2FPfcuPe2yM3DMAxDegUkEveS
 7fkbv/qRZ30NqfchGanX/pmBlLOcdI76qe/bwhN19wCnw48O1eeHi1HK8rWGhU+E
 qcrRZ3ZS2ufP/MVBuhauy0qU9Q4wcEtm7NGGP1231ZtmfjHKyBa4pLirNfG1AlJt
 dK7tBrknVx+WVm/UddJp/fEsxbP0+fki6TwzioHUSWcz8rDVYF6PFT/QPM54SX2h
 0oqwvu6d/uShpkVRm+fbje8FHmUxKdgqDsCYX2fNjWskh1oXSPsItvjqmTmTlE0i
 EBmBwVwI0uB1ZQ3PrJLadhRcO3ZJmLQ5gNj456dstvWy6UQds1xyIQ/DgvmlzxWO
 E9t0l18xHGRwbndsDa8f
 =j5dP
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Another set of mainly bugfixes and a couple of cleanups.  No new
  functionality in this round.

  Highlights include:

  Stable patches:
   - Fix a regression in /proc/self/mountstats
   - Fix the pNFS flexfiles O_DIRECT support
   - Fix high load average due to callback thread sleeping

  Bugfixes:
   - Various patches to fix the pNFS layoutcommit support
   - Do not cache pNFS deviceids unless server notifications are enabled
   - Fix a SUNRPC transport reconnection regression
   - make debugfs file creation failure non-fatal in SUNRPC
   - Another fix for circular directory warnings on NFSv4 "junctioned"
     mountpoints
   - Fix locking around NFSv4.2 fallocate() support
   - Truncating NFSv4 file opens should also sync O_DIRECT writes
   - Prevent infinite loop in rpcrdma_ep_create()

  Features:
   - Various improvements to the RDMA transport code's handling of
     memory registration
   - Various code cleanups"

* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
  fs/nfs: fix new compiler warning about boolean in switch
  nfs: Remove unneeded casts in nfs
  NFS: Don't attempt to decode missing directory entries
  Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
  NFS: Rename idmap.c to nfs4idmap.c
  NFS: Move nfs_idmap.h into fs/nfs/
  NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
  NFS: Add a stub for GETDEVICELIST
  nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
  nfs: fix DIO good bytes calculation
  nfs: Fetch MOUNTED_ON_FILEID when updating an inode
  sunrpc: make debugfs file creation failure non-fatal
  nfs: fix high load average due to callback thread sleeping
  NFS: Reduce time spent holding the i_mutex during fallocate()
  NFS: Don't zap caches on fallocate()
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Add "open" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  ...
2015-04-26 17:33:59 -07:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Jens Axboe
fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Trond Myklebust
21330b6670 Merge branch 'bugfixes'
* bugfixes:
  NFSv4: Return delegations synchronously in evict_inode
  SUNRPC: Fix a regression when reconnecting
  NFS: remount with security change should return EINVAL
  nfs: do not export discarded symbols
  NFSv4.1: don't export static symbol
2015-04-23 15:16:27 -04:00
Andre Przywara
c775707483 fs/nfs: fix new compiler warning about boolean in switch
The brand new GCC 5.1.0 warns by default on using a boolean in the
switch condition. This results in the following warning:

fs/nfs/nfs4proc.c: In function 'nfs4_proc_get_rootfh':
fs/nfs/nfs4proc.c:3100:10: warning: switch condition has boolean value [-Wswitch-bool]
  switch (auth_probe) {
          ^

This code was obviously using switch to make use of the fall-through
semantics (without the usual comment, though).
Rewrite that code using if statements to avoid the warning and make
the code a bit more readable on the way.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 15:16:16 -04:00