33562 Commits

Author SHA1 Message Date
Joseph Qi
17caf9555e ocfs2: add the missing return value check of ocfs2_xattr_get_clusters
In ocfs2_xattr_value_attach_refcount(), if error occurs when calling
ocfs2_xattr_get_clusters(), it will go with unexpected behavior since
local variables p_cluster, num_clusters and ext_flags are declared without
initialization.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:45 -07:00
Jie Liu
4704aa30fc ocfs2: fix a memory leak in __ocfs2_move_extents()
The ocfs2 path is not properly freed which leads to a memory leak at
__ocfs2_move_extents().

This patch stops the leaks of the ocfs2_path structure.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Younger Liu <younger.liu@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:44 -07:00
Joseph Qi
2b0f6eae2d ocfs2: add missing return value check of ocfs2_get_clusters()
In ocfs2_attach_refcount_tree() and ocfs2_duplicate_extent_list(), if
error occurs when calling ocfs2_get_clusters(), it will go with
unexpected behavior as local variables p_cluster, num_clusters and
ext_flags are declared without initialization.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:44 -07:00
Joseph Qi
3d94ea51c1 ocfs2: clean up dead code in ocfs2_acl_from_xattr()
In ocfs2_acl_from_xattr(), if size is less than sizeof(struct
posix_acl_entry), it returns ERR_PTR(-EINVAL) directly.  Then assign (size
/ sizeof(struct posix_acl_entry)) to count which will be at least 1, that
means the following branch (count < 0) and (count == 0) will never be
true.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:39 -07:00
Dong Fang
df53cd3b70 ocfs2: use list_for_each_entry() instead of list_for_each()
[dan.carpenter@oracle.com: fix up some NULL dereference bugs]
Signed-off-by: Dong Fang <yp.fangdong@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:36 -07:00
Sunil Mushran
8dd7903e48 fs/ocfs2/cluster/tcp.c: fix possible null pointer dereferences
Fix some possible null pointer dereferences that were detected by the
static code analyser, smatch.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:34 -07:00
Younger Liu
7e9b793707 ocfs2: ac_bits_wanted should be local_alloc_bits when returns -ENOSPC
There is an issue in reserving and claiming space for localalloc, When
localalloc space is not enough, it would claim space from global_bitmap.
And if there is not enough free space in global_bitmap, the size of
claiming space would set to half of orignal size and retry.

The issue is as follows: osb->local_alloc_bits is set to half of orignal
size in ocfs2_recalc_la_window(), but ac->ac_bits_wanted is set to
osb->local_alloc_default_bits which is not changed.  localalloc always
reserves and claims local_alloc_default_bits space and returns ENOSPC.

So, ac->ac_bits_wanted should be osb->local_alloc_bits which would be
changed.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:31 -07:00
Xue jiufei
98ac9125c5 ocfs2: dlm_request_all_locks() should deal with the status sent from target node
dlm_request_all_locks() should deal with the status sent from target node
if DLM_LOCK_REQUEST_MSG is sent successfully, or recovery master will fall
into endless loop, waiting for other nodes to send locks and
DLM_RECO_DATA_DONE_MSG to me.

        NodeA                                  NodeB
                                     selected as recovery master
                                     dlm_remaster_locks()
                                     ->dlm_request_all_locks()
                                     send DLM_LOCK_REQUEST_MSG to nodeA

It happened that NodeA cannot alloc memory when it processes this
message.  dlm_request_all_locks_handler() do not queue
dlm_request_all_locks_worker and returns -ENOMEM.  It will never send
locks and DLM_RECO_DATA_DONE_MSG to NodeB.

                                    NodeB do not deal with the status
                                    sent from nodeA, and will fall in
                                    endless loop waiting for the
                                    recovery state of NodeA to be
                                    changed.

Signed-off-by: joyce <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:31 -07:00
Junxiao Bi
f17c20dd2e ocfs2: use i_size_read() to access i_size
Though ocfs2 uses inode->i_mutex to protect i_size, there are both
i_size_read/write() and direct accesses.  Clean up all direct access to
eliminate confusion.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:30 -07:00
Younger Liu
2b1e55c389 ocfs2: lighten up allocate transaction
The issue scenario is as following:

When fallocating a very large disk space for a small file,
__ocfs2_extend_allocation attempts to get a very large transaction.  For
some journal sizes, there may be not enough room for this transaction,
and the fallocate will fail.

The patch below extends & restarts the transaction as necessary while
allocating space, and should work with even the smallest journal.  This
patch refers ext4 resize.

Test:
# mkfs.ocfs2 -b 4K -C 32K -T datafiles /dev/sdc
...(jounral size is 32M)
# mount.ocfs2 /dev/sdc /mnt/ocfs2/
# touch /mnt/ocfs2/1.log
# fallocate -o 0 -l 400G /mnt/ocfs2/1.log
fallocate: /mnt/ocfs2/1.log: fallocate failed: Cannot allocate memory
# tail -f /var/log/messages
[ 7372.278591] JBD: fallocate wants too many credits (2051 > 2048)
[ 7372.278597] (fallocate,6438,0):__ocfs2_extend_allocation:709 ERROR: status = -12
[ 7372.278603] (fallocate,6438,0):ocfs2_allocate_unwritten_extents:1504 ERROR: status = -12
[ 7372.278607] (fallocate,6438,0):__ocfs2_change_file_space:1955 ERROR: status = -12
^C
With this patch, the test works well.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:28 -07:00
Linus Torvalds
2b76db6a0f for-linus-3.12-merge minor 9p fixes and tweaks for 3.12 merge window
The first fixes namespace issues which causes a kernel
 NULL pointer dereference, the second fixes uevent
 handling to work better with udev, and the third
 switches some code to use srlcpy instead of strncpy
 in order to be safer.
 
 All changes have been baking in for-next for at least
 2 weeks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJSMJjZAAoJEDZk62b0Tg6x81sQAKa60QStBKhnL65bvG+ooIsS
 mhwfmFyaWOKw1ezwY2Vk0+JnmKDBpKmqjjwyL3nLP18TcRZStPiFdcJBKWl+czge
 FTv14t54CcjysYPbYN7+gUap4F5mfg0mcHaR0UGow505dNyjwd7mqkZhy1IqhdvP
 Ue/h0RE46GeNtdirxrKBdEfW/7TAL0tcoRgjKu0ev1V2sXCJZywuXgkzWjByRXwT
 JOg04gGnYThuek0/KUPRhf0KxB0CyKrZiics7LGb40HkYYxs7ahADACttLyiDr8l
 GntfHXLgvVlU5QcSbKRfLp0zNbi7AxWmJrwYsEwpas4tUw1Q+pVJ2EE2Ameuq5G+
 LrMGmRVQCVYw8UN+OYUO7glhXEJcCPJj6vxgm+NVXx24yaQyGI1aTsIEjHwZ/hkm
 wlQHC47z6/fIypkXpsU6pYWF/r3GwXHokYReejATQWEPIzIxvHeThe0jjqMLth7F
 zmsHZTpmECqtti1fizy5wBZD25wAIxdf+rf8nKy1VvcSN4s08ESSlC/kV/siNeko
 efFnL8xbjP5SPEVoBtXM6eTDHrQ0S+ACSGWtp0FGXKOW4PKzS60ve2Stp+FYZgQc
 WgXI7+NBU6Z9z+cZ9bsY0hrGwK1YZiR4F3KJ5ofTuxAO6n7zd+N3fGBuQJ2tiW9P
 pKtIXNozWqnAU9Wx4rGa
 =YbFT
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p updates from Eric Van Hensbergen:
 "Minor 9p fixes and tweaks for 3.12 merge window

  The first fixes namespace issues which causes a kernel NULL pointer
  dereference, the second fixes uevent handling to work better with
  udev, and the third switches some code to use srlcpy instead of
  strncpy in order to be safer.

  All changes have been baking in for-next for at least 2 weeks"

* tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: avoid accessing utsname after namespace has been torn down
  9p: send uevent after adding/removing mount_tag attribute
  fs: 9p: use strlcpy instead of strncpy
2013-09-11 12:34:13 -07:00
Linus Torvalds
53bf710832 A couple of minor additional sanity check patches for corrupted information,
and some fixes.  Apart from that there's a minor loop optimisation.
 
 These sanity checks mainly exist to trap maliciously corrupted
 filesystems either through using a deliberately modified mksquashfs,
 or where the user has deliberately chosen to generate uncompressed
 metadata and then corrupted it.
 
 Normally metadata in Squashfs filesystems is compressed, which means
 corruption (either accidental or malicious) is detected when
 trying to decompress the metadata.  So corrupted data does not normally
 get as far as the code paths in question here.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMJLmAAoJEJAch/D1fbHUXSUP/1kCIu6EgpFHtrgjiaOV2RZh
 WiQ+Hxnn6F6f6Zvt3GKwCrj0nmIiGxhPeOrg8EXt26UYlFQi2NBzqHU/kcrHwz/H
 re2BObRb50FdUb/MfqtIiyXPmjwb0AlY+XuaG8+xnc6snoZ9076T6qSEeoqvehU5
 cVqgTsFoGNpU+4ymtXUe2bgGsSZN+o0zPdqiM49b9zfHCgbe66fLrI7NTTiygdg5
 vE9mAq0WGxSv6x7SvQyl2+PIZyBpZkYth7kJ9tBk6iEXFIhCy4EGHff/U08WaT9E
 sLMKpy65E2rSxr9L5TXXquKinyWFhlx6yhBYzkJ1TKM8FmGSclmixWzQVhI+5S1B
 1IJR4W/Nq8cfiBaGTOGLtGywuvYAFRuAgjZyodiIgCU2QTxhu5/7h5HZx8ZpjO0x
 wiEuL+6iup8tzPi0x+2XoFKZpV5wi8ozKZ9hiE2dQe2ckTSKYAxUDh0dez+ZiI/W
 J1dyvwn8aB4PQMBSRM7g7PKdGrnb1ZaFzdUt9iloqX1BXmE2ZSG1sHKnBL8Lcrmq
 dRylQ1hzr3o+xVJLZKB+fWlOWZe7YiQM0uMeAXtPyT3gumlO3tqVHOxvHo4DyHx/
 1YmNBL0nSjZJMwG5TCLiV//s/qFXV4d8YN9KngIkliEgGHRvRuIWXHI37WaPt9k5
 OV6uNKK4sy2roTjc8pL/
 =r/lz
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs updates from Phillip Lougher:
 "A couple of minor additional sanity check patches for corrupted
  information, and some fixes.  Apart from that there's a minor loop
  optimisation.

  These sanity checks mainly exist to trap maliciously corrupted
  filesystems either through using a deliberately modified mksquashfs,
  or where the user has deliberately chosen to generate uncompressed
  metadata and then corrupted it.

  Normally metadata in Squashfs filesystems is compressed, which means
  corruption (either accidental or malicious) is detected when trying to
  decompress the metadata.  So corrupted data does not normally get as
  far as the code paths in question here"

* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: add corruption check for type in squashfs_readdir()
  Squashfs: add corruption check in get_dir_index_using_offset()
  Squashfs: fix corruption checks in squashfs_readdir()
  Squashfs: fix corruption checks in squashfs_lookup()
  Squashfs: fix corruption check in get_dir_index_using_name()
  Squashfs: Optimized uncompressed buffer loop
  Squashfs: sanity check information from disk
2013-09-11 12:33:12 -07:00
Weston Andros Adamson
312cd958a7 NFSv4.1: sp4_mach_cred: WARN_ON -> WARN_ON_ONCE
No need to spam the logs

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-11 09:08:08 -04:00
Weston Andros Adamson
ade33ff58e NFSv4.1: sp4_mach_cred: no need to ref count creds
The cl_machine_cred doesn't need to be reference counted here -
a reference is held is for the lifetime of the struct nfs_client.
Also, no need to put_rpccred the rpc_message.rpc_cred.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-11 09:07:53 -04:00
Weston Andros Adamson
7cb852dfc8 NFSv4.1: fix SECINFO* use of put_rpccred
Recent SP4_MACH_CRED changes allows rpc_message.rpc_cred to change,
so keep a separate pointer to the machine cred for put_rpccred.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-11 09:07:27 -04:00
Weston Andros Adamson
a02796250f NFSv4.1: sp4_mach_cred: ask for WRITE and COMMIT
Request SP4_MACH_CRED WRITE and COMMIT support in spo_must_allow list --
they're already supported by the client.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-11 09:06:43 -04:00
Linus Torvalds
cf596766fc Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "This was a very quiet cycle! Just a few bugfixes and some cleanup"

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux:
  rpc: let xdr layer allocate gssproxy receieve pages
  rpc: fix huge kmalloc's in gss-proxy
  rpc: comment on linux_cred encoding, treat all as unsigned
  rpc: clean up decoding of gssproxy linux creds
  svcrpc: remove unused rq_resused
  nfsd4: nfsd4_create_clid_dir prints uninitialized data
  nfsd4: fix leak of inode reference on delegation failure
  Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR"
  sunrpc: prepare NFS for 2038
  nfsd4: fix setlease error return
  nfsd: nfs4_file_get_access: need to be more careful with O_RDWR
2013-09-10 20:04:59 -07:00
Glauber Costa
f5e1dd3456 super: fix for destroy lrus
This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
and moves the deletion to after the shrinker is unregistered, as correctly
spotted by Dave

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:32 -04:00
Glauber Costa
5ca302c8e5 list_lru: dynamically adjust node arrays
We currently use a compile-time constant to size the node array for the
list_lru structure.  Due to this, we don't need to allocate any memory at
initialization time.  But as a consequence, the structures that contain
embedded list_lru lists can become way too big (the superblock for
instance contains two of them).

This patch aims at ameliorating this situation by dynamically allocating
the node arrays with the firmware provided nr_node_ids.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:32 -04:00
Dave Chinner
1ab6c4997e fs: convert fs shrinkers to new scan/count API
Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time.  For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.

I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e.  all the time under
memory pressure).

[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
[assorted fixes folded in]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
35163417fb xfs: fix dquot isolation hang
The new LRU list isolation code in xfs_qm_dquot_isolate() isn't
completely up to date.  Firstly, it needs conversion to return enum
lru_status values, not raw numbers. Secondly - most importantly - it
fails to unlock the dquot and relock the LRU in the LRU_RETRY path.
This leads to deadlocks in xfstests generic/232. Fix them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Andrew Morton
2f5b56f856 xfs-convert-dquot-cache-lru-to-list_lru-fix
fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
cd56a39a59 xfs: convert dquot cache lru to list_lru
Convert the XFS dquot lru to use the list_lru construct and convert the
shrinker to being node aware.

[glommer@openvz.org: edited for conflicts + warning fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
a408235726 xfs: rework buffer dispose list tracking
In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Andrew Morton
addbda40be xfs-convert-buftarg-lru-to-generic-code-fix
fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
e80dfa1997 xfs: convert buftarg LRU to generic code
Convert the buftarg LRU to use the new generic LRU list and take advantage
of the functionality it supplies to make the buffer cache shrinker node
aware.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
9b17c62382 fs: convert inode and dentry shrinking to be node aware
Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
0ce3d74450 shrinker: add node awareness
Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for node
aware shrinkers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Glauber Costa
4e717f5c10 list_lru: remove special case function list_lru_dispose_all.
The list_lru implementation has one function, list_lru_dispose_all, with
only one user (the dentry code).  At first, such function appears to make
sense because we are really not interested in the result of isolating each
dentry separately - all of them are going away anyway.  However, it's
implementation is buggy in the following way:

When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
marking them with DCACHE_SHRINK_LIST.  However, this is done without the
nlru->lock taken.  The imediate result of that is that someone else may
add or remove the dentry from the LRU at the same time.  When list_lru_del
happens in that scenario we will see an element that is not yet marked
with DCACHE_SHRINK_LIST (even though it will be in the future) and
obviously remove it from an lru where the element no longer is.  Since
list_lru_dispose_all will in effect count down nlru's nr_items and
list_lru_del will do the same, this will lead to an imbalance.

The solution for this would not be so simple: we can obviously just keep
the lru_lock taken, but then we have no guarantees that we will be able to
acquire the dentry lock (dentry->d_lock).  To properly solve this, we need
a communication mechanism between the lru and dentry code, so they can
coordinate this with each other.

Such mechanism already exists in the form of the list_lru_walk_cb
callback.  So it is possible to construct a dcache-side prune function
that does the right thing only by calling list_lru_walk in a loop until no
more dentries are available.

With only one user, plus the fact that a sane solution for the problem
would involve boucing between dcache and list_lru anyway, I see little
justification to keep the special case list_lru_dispose_all in tree.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Dave Chinner
f604156751 dcache: convert to use new lru list infrastructure
[glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Glauber Costa
d38fa6986e inode: move inode to a different list inside lock
When removing an element from the lru, this will be done today after the lock
is released. This is a clear mistake, although we are not sure if the bugs we
are seeing are related to this. All list manipulations are done inside the
lock, and so should this one.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Tested-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
bc3b14cb2d inode: convert inode lru list to generic lru list code.
[glommer@openvz.org: adapted for new LRU return codes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
0a234c6dcb shrinker: convert superblock shrinkers to new API
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts.  The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted to
the count/scan API.  This is mainly a mechanical change.

[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
dd1f6b2e43 dcache: remove dentries from LRU before putting on dispose list
One of the big problems with modifying the way the dcache shrinker and LRU
implementation works is that the LRU is abused in several ways.  One of
these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list without
doing any accounting changes, and then use dentry_lru_prune() to remove it
from what-ever list it is now on to do the LRU accounting at that point.

This makes it -really hard- to change the LRU implementation.  The use of
the per-sb LRU lock serialises movement of the dentries between the
different lists and the removal of them, and this is the only reason that
it works.  If we want to break up the dentry LRU lock and lists into, say,
per-node lists, we remove the only serialisation that allows this lru
list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated from
the LRU list - dentries have to be removed from the LRU *before* being
placed on the dispose list.  This means that the LRU accounting and
isolation is completed before disposal is started, and that means we can
change the LRU implementation freely in future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
they are placed on the dispose list so that we don't think that parent
dentries found in try_prune_one_dentry() are on the LRU when the are
actually on the dispose list.  This would result in accounting the dentry
to the LRU a second time.  Hence dentry_lru_del() has to handle the
DCACHE_SHRINK_LIST case

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
19156840e3 dentry: move to per-sb LRU locks
With the dentry LRUs being per-sb structures, there is no real need for
a global dentry_lru_lock. The locking can be made more fine-grained by
moving to a per-sb LRU lock, isolating the LRU operations of different
filesytsems completely from each other. The need for this is independent
of any performance consideration that may arise: in the interest of
abstracting the lru operations away, it is mandatory that each lru works
around its own lock instead of a global lock for all of them.

[glommer@openvz.org: updated changelog ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Dave Chinner
62d36c7703 dcache: convert dentry_stat.nr_unused to per-cpu counters
Before we split up the dcache_lru_lock, the unused dentry counter needs to
be made independent of the global dcache_lru_lock.  Convert it to per-cpu
counters to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:30 -04:00
Glauber Costa
55f841ce93 super: fix calculation of shrinkable objects for small numbers
The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.

It works great in situations in which we have many objects (at least more
than 100), because the aproximation errors will be negligible.  But if
this is not the case, specially when total_objects < 100, we may end up
concluding that we have no objects at all (total / 100 = 0, if total <
100).

This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Glauber Costa
3942c07ccf fs: bump inode and dentry counters to long
This series reworks our current object cache shrinking infrastructure in
two main ways:

 * Noticing that a lot of users copy and paste their own version of LRU
   lists for objects, we put some effort in providing a generic version.
   It is modeled after the filesystem users: dentries, inodes, and xfs
   (for various tasks), but we expect that other users could benefit in
   the near future with little or no modification.  Let us know if you
   have any issues.

 * The underlying list_lru being proposed automatically and
   transparently keeps the elements in per-node lists, and is able to
   manipulate the node lists individually.  Given this infrastructure, we
   are able to modify the up-to-now hammer called shrink_slab to proceed
   with node-reclaim instead of always searching memory from all over like
   it has been doing.

Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock.  The locks usually disappear from the profilers with this
change.

Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.

With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim.  Historically,
those two pieces of work have been posted together.  This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested.  You can see more about the
history of such work at http://lwn.net/Articles/552769/

Dave Chinner (18):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: rework buffer dispose list tracking
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.

Glauber Costa (7):
  fs: bump inode and dentry counters to long
  super: fix calculation of shrinkable objects for small numbers
  list_lru: per-node API
  vmscan: per-node deferred work
  i915: bail out earlier when shrinker cannot acquire mutex
  hugepage: convert huge zero page shrinker to new shrinker API
  list_lru: dynamically adjust node arrays

This patch:

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc.  This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset.  So we believe it is time for a change.  This
patch just moves int to longs.  Machines where it matters should have a
big long anyway.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Dave Jones
da5338c749 Add missing unlocks to error paths of mountpoint_last.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Al Viro
bcce56d585 ... and fold the renamed __vfs_follow_link() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Christoph Hellwig
aac34df117 fs: remove vfs_follow_link
For a long time no filesystem has been using vfs_follow_link, and as seen
by recent filesystem submissions any new use is accidental as well.

Remove vfs_follow_link, document the replacement in
Documentation/filesystems/porting and also rename __vfs_follow_link
to match its only caller better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Linus Torvalds
b05430fc93 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 3 (of many) from Al Viro:
 "Waiman's conversion of d_path() and bits related to it,
  kern_path_mountpoint(), several cleanups and fixes (exportfs
  one is -stable fodder, IMO).

  There definitely will be more...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
  dcache: Translating dentry into pathname without taking rename_lock
  autofs4 - fix device ioctl mount lookup
  introduce kern_path_mountpoint()
  rename user_path_umountat() to user_path_mountpoint_at()
  take unlazy_walk() into umount_lookup_last()
  Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
  prune_super(): sb->s_op is never NULL
  exportfs: don't assume that ->iterate() won't feed us too long entries
  afs: get rid of redundant ->d_name.len checks
2013-09-10 12:44:24 -07:00
Linus Torvalds
d0d2727710 vfs: make sure we don't have a stale root path if unlazy_walk() fails
When I moved the RCU walk termination into unlazy_walk(), I didn't copy
quite all of it: for the successful RCU termination we properly add the
necessary reference counts to our temporary copy of the root path, but
for the failure case we need to make sure that any temporary root path
information is cleared out (since it does _not_ have the proper
reference counts from the RCU lookup).

We could clean up this mess by just always dropping the temporary root
information, but Al points out that that would mean that a single lookup
through symlinks could see multiple different root entries if it races
with another thread doing chroot.  Not that I think we should really
care (we had that before too, back before we had a copy of the root path
in the nameidata).

Al says he has a cunning plan.  In the meantime, this is the minimal fix
for the problem, even if it's not all that pretty.

Reported-by: Mace Moneta <moneta.mace@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-10 12:17:49 -07:00
Dave Chinner
74ffa796e1 xfs: don't assert fail on bad inode numbers
Let the inode verifier do it's work by returning an error when we
fail to find correct magic numbers in an inode buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 14:07:54 -05:00
Dave Chinner
46f9d2eb37 xfs: aborted buf items can be in the AIL.
Saw this on generic/270 after a DQALLOC transaction overrun
shutdown:

XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 952
.....
 xfs_buf_item_relse+0x4f/0xd0
 xfs_buf_item_unlock+0x1b4/0x1e0
 xfs_trans_free_items+0x7d/0xb0
 xfs_trans_cancel+0x13c/0x1b0
 xfs_symlink+0x37e/0xa60
....

When a transaction abort occured.

If we are aborting a transaction and trigger this code path, then
the item may be dirty. If the item is dirty, then it may be in the
AIL. Hence if we are aborting, we need to check if the item is in
the AIL and remove it before freeing it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 13:58:07 -05:00
Dave Chinner
fdd3cceef4 xfs: factor all the kmalloc-or-vmalloc fallback allocations
We have quite a few places now where we do:

	x = kmem_zalloc(large size)
	if (!x)
		x = kmem_zalloc_large(large size)

and do a similar dance when freeing the memory. kmem_free() already
does the correct freeing dance, and kmem_zalloc_large() is only ever
called in these constructs, so just factor it all into
kmem_zalloc_large() and kmem_free().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 13:57:03 -05:00
Dave Chinner
2dc164f296 xfs: fix memory allocation failures with ACLs
Ever since increasing the number of supported ACLs from 25 to as
many as can fit in an xattr, there have been reports of order 4
memory allocations failing in the ACL code. Fix it in the same way
we've fixed all the xattr read/write code that has the same problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 13:56:25 -05:00
Dave Chinner
0a4edc8f0b xfs: ensure we copy buffer type in da btree root splits
When splitting the root of the da btree, we shuffled data between
buffers and the structures that track them. At one point, we copy
data and state from one buffer to another, including the ops
associated with the buffer. When we do this, we also need to copy
the buffer type associated with the buf log item so that the buffer
is logged correctly. If we don't do that, log recovery won't
recognise it and hence it won't recalculate the CRC on the buffer
after recovery. This leads to a directory block that can't be read
after recovery has run.

Found by inspection after finding the same problem with remote
symlink buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 13:34:05 -05:00
Dave Chinner
daf7b799a9 xfs: set remote symlink buffer type for recovery
The logging of a remote symlink block does not set the buffer type
being logged, and hence on recovery the type of buffer is not
recognised and hence CRCs are not calculated after replay. This
results in log recoery throwing:

XFS (vdc): Unknown buffer type 0

errors, and subsequent reads of the symlink failing CRC
verification. Found via fsstress + godown.

Reported by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 12:57:09 -05:00
Dave Chinner
638f44163d xfs: recovery of swap extents operations for CRC filesystems
This is the recovery side of the btree block owner change operation
performed by swapext on CRC enabled filesystems. We detect that an
owner change is needed by the flag that has been placed on the inode
log format flag field. Because the inode recovery is being replayed
after the buffers that make up the BMBT in the given checkpoint, we
can walk all the buffers and directly modify them when we see the
flag set on an inode.

Because the inode can be relogged and hence present in multiple
chekpoints with the "change owner" flag set, we could do multiple
passes across the inode to do this change. While this isn't optimal,
we can't directly ignore the flag as there may be multiple
independent swap extent operations being replayed on the same inode
in different checkpoints so we can't ignore them.

Further, because the owner change operation uses ordered buffers, we
might have buffers that are newer on disk than the current
checkpoint and so already have the owner changed in them. Hence we
cannot just peek at a buffer in the tree and check that it has the
correct owner and assume that the change was completed.

So, for the moment just brute force the owner change every time we
see an inode with the flag set. Note that we have to be careful here
because the owner of the buffers may point to either the old owner
or the new owner. Currently the verifier can't verify the owner
directly, so there is no failure case here right now. If we verify
the owner exactly in future, then we'll have to take this into
account.

This was tested in terms of normal operation via xfstests - all of
the fsr tests now pass without failure. however, we really need to
modify xfs/227 to stress v3 inodes correctly to ensure we fully
cover this case for v5 filesystems.

In terms of recovery testing, I used a hacked version of xfs_fsr
that held the temp inode open for a few seconds before exiting so
that the filesystem could be shut down with an open owner change
recovery flags set on at least the temp inode. fsr leaves the temp
inode unlinked and in btree format, so this was necessary for the
owner change to be reliably replayed.

logprint confirmed the tmp inode in the log had the correct flag set:

INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88
        INODE: #regs:3   ino:0x44  flags:0x209   dsize:88
	                                 ^^^^^

0x200 is set, indicating a data fork owner change needed to be
replayed on inode 0x44.  A printk in the revoery code confirmed that
the inode change was recovered:

XFS (vdc): Mounting Filesystem
XFS (vdc): Starting recovery (logdev: internal)
recovering owner change ino 0x44
XFS (vdc): Version 5 superblock detected. This kernel L support enabled!
Use of these features in this kernel is at your own risk!
XFS (vdc): Ending recovery (logdev: internal)

The script used to test this was:

$ cat ./recovery-fsr.sh
#!/bin/bash

dev=/dev/vdc
mntpt=/mnt/scratch
testfile=$mntpt/testfile

umount $mntpt
mkfs.xfs -f -m crc=1 $dev
mount $dev $mntpt
chmod 777 $mntpt

for i in `seq 10000 -1 0`; do
        xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1
done
xfs_bmap -vp $testfile |head -20

xfs_fsr -d -v $testfile &
sleep 10
/home/dave/src/xfstests-dev/src/godown -f $mntpt
wait
umount $mntpt

xfs_logprint -t $dev |tail -20
time mount $dev $mntpt
xfs_bmap -vp $testfile
umount $mntpt
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-10 12:49:57 -05:00