Commit Graph

16475 Commits

Author SHA1 Message Date
Erez Zadok
0d132f7364 ecryptfs: don't ignore return value from lock_rename
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:35:59 -06:00
Erez Zadok
e27759d7a3 ecryptfs: initialize private persistent file before dereferencing pointer
Ecryptfs_open dereferences a pointer to the private lower file (the one
stored in the ecryptfs inode), without checking if the pointer is NULL.
Right afterward, it initializes that pointer if it is NULL.  Swap order of
statements to first initialize.  Bug discovered by Duckjin Kang.

Signed-off-by: Duckjin Kang <fromdj2k@gmail.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:54 -06:00
Tyler Hicks
38e3eaeedc eCryptfs: Remove mmap from directory operations
Adrian reported that mkfontscale didn't work inside of eCryptfs mounts.
Strace revealed the following:

open("./", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3
fcntl64(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
open("./fonts.scale", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
getdents(3, /* 80 entries */, 32768) = 2304
open("./.", O_RDONLY) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
fstat64(5, {st_mode=S_IFDIR|0755, st_size=16384, ...}) = 0
mmap2(NULL, 16384, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7fcf000
close(5) = 0
--- SIGBUS (Bus error) @ 0 (0) ---
+++ killed by SIGBUS +++

The mmap2() on a directory was successful, resulting in a SIGBUS
signal later.  This patch removes mmap() from the list of possible
ecryptfs_dir_fops so that mmap() isn't possible on eCryptfs directory
files.

https://bugs.launchpad.net/ecryptfs/+bug/400443

Reported-by: Adrian C. <anrxc@sysphere.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:11 -06:00
Tyler Hicks
f8f484d1b6 eCryptfs: Add getattr function
The i_blocks field of an eCryptfs inode cannot be trusted, but
generic_fillattr() uses it to instantiate the blocks field of a stat()
syscall when a filesystem doesn't implement its own getattr().  Users
have noticed that the output of du is incorrect on newly created files.

This patch creates ecryptfs_getattr() which calls into the lower
filesystem's getattr() so that eCryptfs can use its kstat.blocks value
after calling generic_fillattr().  It is important to note that the
block count includes the eCryptfs metadata stored in the beginning of
the lower file plus any padding used to fill an extent before
encryption.

https://bugs.launchpad.net/ecryptfs/+bug/390833

Reported-by: Dominic Sacré <dominic.sacre@gmx.de>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:09 -06:00
Tyler Hicks
5f3ef64f4d eCryptfs: Use notify_change for truncating lower inodes
When truncating inodes in the lower filesystem, eCryptfs directly
invoked vmtruncate(). As Christoph Hellwig pointed out, vmtruncate() is
a filesystem helper function, but filesystems may need to do more than
just a call to vmtruncate().

This patch moves the lower inode truncation out of ecryptfs_truncate()
and renames the function to truncate_upper().  truncate_upper() updates
an iattr for the lower inode to indicate if the lower inode needs to be
truncated upon return.  ecryptfs_setattr() then calls notify_change(),
using the updated iattr for the lower inode, to complete the truncation.

For eCryptfs functions needing to truncate, ecryptfs_truncate() is
reintroduced as a simple way to truncate the upper inode to a specified
size and then truncate the lower inode accordingly.

https://bugs.launchpad.net/bugs/451368

Reported-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:07 -06:00
Linus Torvalds
1e868d8e6d Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: xfs_swap_extents needs to handle dynamic fork offsets
  xfs: fix missing error check in xfs_rtfree_range
  xfs: fix stale inode flush avoidance
  xfs: Remove inode iolock held check during allocation
  xfs: reclaim all inodes by background tree walks
  xfs: Avoid inodes in reclaim when flushing from inode cache
  xfs: reclaim inodes under a write lock
2010-01-18 14:08:07 -08:00
Linus Torvalds
7dc9c484a7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  do_add_mount() should sanitize mnt_flags
  CIFS shouldn't make mountpoints shrinkable
  mnt_flags fixes in do_remount()
  attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()
  may_umount() needs namespace_sem
  Fix configfs leak
  Fix the -ESTALE handling in do_filp_open()
  ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error path
  Fix ACC_MODE() for real
  Unrot uml mconsole a bit
  hppfs: handle ->put_link()
  Kill 9p readlink()
  fix autofs/afs/etc. magic mountpoint breakage
2010-01-17 11:01:16 -08:00
David Howells
7e6608724c nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation.  The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.

The following sequence of events can cause the problem:

	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
	ftruncate(fd, 32 * 1024);
	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	munmap(a, 32 * 1024);
	ftruncate(fd, 16 * 1024);
	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.

Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'.  However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.

Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.

However:

	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'd' should work, and should end up sharing the region allocated by
'a'.

To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:40 -08:00
David Howells
81759b5b22 nommu: fix race between ramfs truncation and shared mmap
Fix the race between the truncation of a ramfs file and an attempt to make
a shared mmap of region of that file.

The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to
verify that the file region is made of contiguous pages and to find its
base address - but there isn't any locking to guarantee this region until
vma_prio_tree_insert() is called by add_vma_to_mm().

Note that moving the functionality into f_op->mmap() doesn't help as that
is also called before vma_prio_tree_insert().

Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it
does its checks.  This means that this function will wait whilst mmaps
take place.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:40 -08:00
Al Viro
27d55f1f4c do_add_mount() should sanitize mnt_flags
MNT_WRITE_HOLD shouldn't leak into new vfsmount and neither
should MNT_SHARED (the latter will be set properly, along with
the rest of shared-subtree data structures)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:07:36 -05:00
Al Viro
7e1295d9f8 CIFS shouldn't make mountpoints shrinkable
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:06:32 -05:00
Al Viro
7b43a79f32 mnt_flags fixes in do_remount()
* need vfsmount_lock over modifying it
* need to preserve MNT_SHARED/MNT_UNBINDABLE

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:01:26 -05:00
Al Viro
df1a1ad297 attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()
race in mnt_flags update

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 12:57:40 -05:00
Al Viro
8ad08d8a0c may_umount() needs namespace_sem
otherwise it races with clone_mnt() changing mnt_share/mnt_slaves

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 12:56:08 -05:00
Eric Paris
976ae32be4 inotify: only warn once for inotify problems
inotify will WARN() if it finds that the idr and the fsnotify internals
somehow got out of sync.  It was only supposed to do this once but due
to this stupid bug it would warn every single time a problem was
detected.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15 14:49:23 -08:00
Eric Paris
9e572cc987 inotify: do not reuse watch descriptors
Since commit 7e790dd5fc ("inotify: fix
error paths in inotify_update_watch") inotify changed the manor in which
it gave watch descriptors back to userspace.  Previous to this commit
inotify acted like the following:

  inotify_add_watch(X, Y, Z) = 1
  inotify_rm_watch(X, 1);
  inotify_add_watch(X, Y, Z) = 2

but after this patch inotify would return watch descriptors like so:

  inotify_add_watch(X, Y, Z) = 1
  inotify_rm_watch(X, 1);
  inotify_add_watch(X, Y, Z) = 1

which I saw as equivalent to opening an fd where

  open(file) = 1;
  close(1);
  open(file) = 1;

seemed perfectly reasonable.  The issue is that quite a bit of userspace
apparently relies on the behavior in which watch descriptors will not be
quickly reused.  KDE relies on it, I know some selinux packages rely on
it, and I have heard complaints from other random sources such as debian
bug 558981.

Although the man page implies what we do is ok, we broke userspace so
this patch almost reverts us to the old behavior.  It is still slightly
racey and I have patches that would fix that, but they are rather large
and this will fix it for all real world cases.  The race is as follows:

 - task1 creates a watch and blocks in idr_new_watch() before it updates
   the hint.
 - task2 creates a watch and updates the hint.
 - task1 updates the hint with it's older wd
 - task removes the watch created by task2
 - task adds a new watch and will reuse the wd originally given to task2

it requires moving some locking around the hint (last_wd) but this should
solve it for the real world and be -stable safe.

As a side effect this patch papers over a bug in the lib/idr code which
is causing a large number WARN's to pop on people's system and many
reports in kerneloops.org.  I'm working on the root cause of that idr
bug seperately but this should make inotify immune to that issue.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15 14:49:23 -08:00
Dave Chinner
e09f98606d xfs: xfs_swap_extents needs to handle dynamic fork offsets
When swapping extents, we can corrupt inodes by swapping data forks
that are in incompatible formats.  This is caused by the two indoes
having different fork offsets due to the presence of an attribute
fork on an attr2 filesystem.  xfs_fsr tries to be smart about
setting the fork offset, but the trick it plays only works on attr1
(old fixed format attribute fork) filesystems.

Changing the way xfs_fsr sets up the attribute fork will prevent
this situation from ever occurring, so in the kernel code we can get
by with a preventative fix - check that the data fork in the
defragmented inode is in a format valid for the inode it is being
swapped into.  This will lead to files that will silently and
potentially repeatedly fail defragmentation, so issue a warning to
the log when this particular failure occurs to let us know that
xfs_fsr needs updating/fixing.

To help identify how to improve xfs_fsr to avoid this issue, add
trace points for the inodes being swapped so that we can determine
why the swap was rejected and to confirm that the code is making the
right decisions and modifications when swapping forks.

A further complication is even when the swap is allowed to proceed
when the fork offset is different between the two inodes then value
for the maximum number of extents the data fork can hold can be
wrong. Make sure these are also set correctly after the swap occurs.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:49:07 -06:00
Dave Chinner
3daeb42c13 xfs: fix missing error check in xfs_rtfree_range
When xfs_rtfind_forw() returns an error, the block is returned
uninitialised.  xfs_rtfree_range() is not checking the error return,
so could be using an uninitialised block number for modifying bitmap
summary info.

The problem was found by gcc when compiling the *userspace* libxfs
code - it is an copy of the kernel code with the exact same bug.
gcc gives an uninitialised variable warning on the userspace code
but not on the kernel code. You gotta love the consistency (Mmmm,
slightly chewy today!).

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:46:19 -06:00
Dave Chinner
4b6a46882c xfs: fix stale inode flush avoidance
When reclaiming stale inodes, we need to guarantee that inodes are
unpinned before returning with a "clean" status. If we don't we can
reclaim inodes that are pinned, leading to use after free in the
transaction subsystem as transactions complete.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:46:02 -06:00
Dave Chinner
126976c7c1 xfs: Remove inode iolock held check during allocation
lockdep complains about a the lock not being initialised as we do an
ASSERT based check that the lock is not held before we initialise it
to catch inodes freed with the lock held.

lockdep does this check for us in the lock initialisation code, so
remove the ASSERT to stop the lockdep warning.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:45:33 -06:00
Dave Chinner
57817c6822 xfs: reclaim all inodes by background tree walks
We cannot do direct inode reclaim without taking the flush lock to
ensure that we do not reclaim an inode under IO. We check the inode
is clean before doing direct reclaim, but this is not good enough
because the inode flush code marks the inode clean once it has
copied the in-core dirty state to the backing buffer.

It is the flush lock that determines whether the inode is still
under IO, even though it is marked clean, and the inode is still
required at IO completion so we can't reclaim it even though it is
clean in core. Hence the requirement that we need to take the flush
lock even on clean inodes because this guarantees that the inode
writeback IO has completed and it is safe to reclaim the inode.

With delayed write inode flushing, we coul dend up waiting a long
time on the flush lock even for a clean inode. The background
reclaim already handles this efficiently, so avoid all the problems
by killing the direct reclaim path altogether.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:44:44 -06:00
Dave Chinner
018027be90 xfs: Avoid inodes in reclaim when flushing from inode cache
The reclaim code will handle flushing of dirty inodes before reclaim
occurs, so avoid them when determining whether an inode is a
candidate for flushing to disk when walking the radix trees.  This
is based on a test patch from Christoph Hellwig.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:44:21 -06:00
Dave Chinner
c8e20be020 xfs: reclaim inodes under a write lock
Make the inode tree reclaim walk exclusive to avoid races with
concurrent sync walkers and lookups. This is a version of a patch
posted by Christoph Hellwig that avoids all the code duplication.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15 13:43:55 -06:00
Al Viro
9b6e310211 Fix configfs leak
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:42 -05:00
Al Viro
9850c05655 Fix the -ESTALE handling in do_filp_open()
Instead of playing sick games with path saving, cleanups, just retry
the entire thing once with LOOKUP_REVAL added.  Post-.34 we'll convert
all -ESTALE handling in there to that style, rather than playing with
many retry loops deep in the call chain.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:26 -05:00
OGAWA Hirofumi
806892e9e1 ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error path
If ->follow_link handler return the error, it should decrement
nd->path refcnt. But, ecryptfs_follow_link() doesn't decrement.

This patch fix it by using usual nd_set_link() style error handling,
instead of playing with nd->path.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:26 -05:00
Al Viro
6d125529c6 Fix ACC_MODE() for real
commit 5300990c03 had stepped on a rather
nasty mess: definitions of ACC_MODE used to be different.  Fixed the
resulting breakage, converting them to variant that takes O_... value;
all callers have that and it actually simplifies life (see tomoyo part
of changes).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:26 -05:00
Al Viro
7b264fc2be hppfs: handle ->put_link()
current code works only because nothing in procfs has non-trivial
->put_link().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:25 -05:00
Al Viro
204f2f0e82 Kill 9p readlink()
For symlinks generic_readlink() will work just fine and for directories
we don't want ->readlink() at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:25 -05:00
Al Viro
86acdca1b6 fix autofs/afs/etc. magic mountpoint breakage
We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT)
if /mnt/tmp is an autofs direct mount.  The reason is that nd.last_type
is bogus here; we want LAST_BIND for everything of that kind and we
get LAST_NORM left over from finding parent directory.

So make sure that it *is* set properly; set to LAST_BIND before
doing ->follow_link() - for normal symlinks it will be changed
by __vfs_follow_link() and everything else needs it set that way.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14 09:05:25 -05:00
Linus Torvalds
e80c14e1ae Merge branch 'fasync-helper'
* fasync-helper:
  fasync: split 'fasync_helper()' into separate add/remove functions
2010-01-13 13:42:49 -08:00
Dave Chinner
2c761270d5 lib: Introduce generic list_sort function
There are two copies of list_sort() in the tree already, one in the DRM
code, another in ubifs.  Now XFS needs this as well.  Create a generic
list_sort() function from the ubifs version and convert existing users
to it so we don't end up with yet another copy in the tree.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Acked-by: Dave Airlie <airlied@redhat.com>
Acked-by: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-12 21:02:00 -08:00
Linus Torvalds
1b4d40a517 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: Ensure we force all busy extents in range to disk
  xfs: Don't flush stale inodes
  xfs: fix timestamp handling in xfs_setattr
  xfs: use DECLARE_EVENT_CLASS
2010-01-11 09:48:48 -08:00
Linus Torvalds
79ecb043ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use MAX_LFS_FILESIZE for meta inode size
  GFS2: Fix gfs2_xattr_acl_chmod()
  GFS2: Fix locking bug in rename
  GFS2: Ensure uptodate inode size when using O_APPEND
2010-01-11 09:48:29 -08:00
Linus Torvalds
db1fc95744 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix dquot_transfer for filesystems different from ext4
2010-01-11 09:48:14 -08:00
Minchan Kim
7f53a09ed4 smaps: fix wrong rss count
A long time ago we regarded zero page as file_rss and vm_normal_page
doesn't return NULL.

But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can
return NULL in case of zero page.  Also we don't count it with file_rss
any more.

Then, RSS and PSS can't be matched.  For consistency, Let's ignore zero
page in smaps_pte_range.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:07 -08:00
KOSAKI Motohiro
1306d603fc proc: partially revert "procfs: provide stack information for threads"
Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status.  But it cause large
performance regression.  Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component.  Because both to
take mmap_sem and page table walk are heavily operation.

If many process run, the ps performance is,

[before d899bf7b]

% perf stat ps >/dev/null

 Performance counter stats for 'ps':

     4090.435806  task-clock-msecs         #      0.032 CPUs
             229  context-switches         #      0.000 M/sec
               0  CPU-migrations           #      0.000 M/sec
             234  page-faults              #      0.000 M/sec
      8587565207  cycles                   #   2099.425 M/sec
      9866662403  instructions             #      1.149 IPC
      3789415411  cache-references         #    926.409 M/sec
        30419509  cache-misses             #      7.437 M/sec

   128.859521955  seconds time elapsed

[after d899bf7b]

% perf stat  ps  > /dev/null

 Performance counter stats for 'ps':

     4305.081146  task-clock-msecs         #      0.028 CPUs
             480  context-switches         #      0.000 M/sec
               2  CPU-migrations           #      0.000 M/sec
             237  page-faults              #      0.000 M/sec
      9021211334  cycles                   #   2095.480 M/sec
     10605887536  instructions             #      1.176 IPC
      3612650999  cache-references         #    839.160 M/sec
        23917502  cache-misses             #      5.556 M/sec

   152.277819582  seconds time elapsed

Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.

Commit d899bf7b introduced two features:

 1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
 2) Add StackUsage field to /proc/{pid}/status.

I only revert (2), because I haven't seen (1) cause regression.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:06 -08:00
Jan Kara
05b5d89823 quota: Fix dquot_transfer for filesystems different from ext4
Commit fd8fbfc1 modified the way we find amount of reserved space
belonging to an inode. The amount of reserved space is checked
from dquot_transfer and thus inode_reserved_space gets called
even for filesystems that don't provide get_reserved_space callback
which results in a BUG.

Fix the problem by checking get_reserved_space callback and return 0 if
the filesystem does not provide it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-01-11 13:06:41 +01:00
Steven Whitehouse
ba198098a2 GFS2: Use MAX_LFS_FILESIZE for meta inode size
Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so
use MAX_LFS_FILESIZE instead.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-11 08:57:55 +00:00
Dave Chinner
fd45e47841 xfs: Ensure we force all busy extents in range to disk
When we search for and find a busy extent during allocation we
force the log out to ensure the extent free transaction is on
disk before the allocation transaction. The current implementation
has a subtle bug in it--it does not handle multiple overlapping
ranges.

That is, if we free lots of little extents into a single
contiguous extent, then allocate the contiguous extent, the busy
search code stops searching at the first extent it finds that
overlaps the allocated range. It then uses the commit LSN of the
transaction to force the log out to.

Unfortunately, the other busy ranges might have more recent
commit LSNs than the first busy extent that is found, and this
results in xfs_alloc_search_busy() returning before all the
extent free transactions are on disk for the range being
allocated. This can lead to potential metadata corruption or
stale data exposure after a crash because log replay won't replay
all the extent free transactions that cover the allocation range.

Modified-by: Alex Elder <aelder@sgi.com>

(Dropped the "found" argument from the xfs_alloc_busysearch trace
event.)

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:22:02 -06:00
Dave Chinner
44e08c45cc xfs: Don't flush stale inodes
Because inodes remain in cache much longer than inode buffers do
under memory pressure, we can get the situation where we have
stale, dirty inodes being reclaimed but the backing storage has
been freed.  Hence we should never, ever flush XFS_ISTALE inodes
to disk as there is no guarantee that the backing buffer is in
cache and still marked stale when the flush occurs.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:22:00 -06:00
Christoph Hellwig
d6d59bada3 xfs: fix timestamp handling in xfs_setattr
We currently have some rather odd code in xfs_setattr for
updating the a/c/mtime timestamps:

 - first we do a non-transaction update if all three are updated
   together
 - second we implicitly update the ctime for various changes
   instead of relying on the ATTR_CTIME flag
 - third we set the timestamps to the current time instead of the
   arguments in the iattr structure in many cases.

This patch makes sure we update it in a consistent way:

 - always transactional
 - ctime is only updated if ATTR_CTIME is set or we do a size
   update, which is a special case
 - always to the times passed in from the caller instead of the
   current time

The only non-size caller of xfs_setattr that doesn't come from
the VFS is updated to set ATTR_CTIME and pass in a valid ctime
value.

Reported-by: Eric Blake <ebb9@byu.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:21:58 -06:00
Christoph Hellwig
ea9a48881e xfs: use DECLARE_EVENT_CLASS
Using DECLARE_EVENT_CLASS allows us to to use trace event code
instead of duplicating it in the binary.  This was not available
before 2.6.33 so it had to be done as a separate step once the
prerequisite was merged.

This only requires changes to xfs_trace.h and the results are
rather impressive:

hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
text	   data	    bss	    dec	    hex	filename
 607732	  41884	   3616	 653232	  9f7b0	fs/xfs/xfs.o
1026732	  41884	   3808	1072424	 105d28	fs/xfs/xfs.o.old

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:21:56 -06:00
Linus Torvalds
82062e7b50 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks
  reiserfs: Fix unreachable statement
  reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock
  reiserfs: Relax lock on xattr removing
  reiserfs: Relax the lock before truncating pages
  reiserfs: Fix recursive lock on lchown
  reiserfs: Fix mistake in down_write() conversion
2010-01-08 14:03:55 -08:00
Linus Torvalds
dbd6a7cfea Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: kill some warnings on i386 builds
2010-01-08 13:57:32 -08:00
Linus Torvalds
e2b6d02cca Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  nfs: fix oops in nfs_rename()
  sunrpc: fix build-time warning
  sunrpc: on successful gss error pipe write, don't return error
  SUNRPC: Fix the return value in gss_import_sec_context()
  SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
2010-01-08 13:55:14 -08:00
Dave Chinner
a539bd8c86 xfs: kill some warnings on i386 builds
Randy Dunlap Reported printk() format-related warnings reported
on i386 builds in his environment.  Dave Chinner provided this
patch to eliminate them.

Signed-off by: Dave Chinner <david@fromorbit.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>

Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08 13:32:29 -06:00
Steven Whitehouse
e412bdb126 GFS2: Fix gfs2_xattr_acl_chmod()
The ref counting for the bh returned by gfs2_ea_find() was
wrong. This patch ensures that we always drop the ref count
to that bh correctly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:59 +00:00
Steven Whitehouse
24b977b5fd GFS2: Fix locking bug in rename
The rename code was taking a resource group lock in cases where
it wasn't actually needed, this caused problems if the rename
was resulting in an inode being unlinked. The patch ensures that
we only take the rgrp lock early if it is really needed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:42 +00:00
Steven Whitehouse
56aa616a03 GFS2: Ensure uptodate inode size when using O_APPEND
The VFS reads the inode size during generic_file_aio_write() but
with no locking around it. In order to get the expected result
from O_APPEND opens, this patch updated the inode size before
calling generic_file_aio_write()

There is of course still a race here, in that there is nothing to
prevent another node coming in and extending the file in the
mean time. On the other hand, when used with file locking this
will ensure that the expected results are obtained.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:27 +00:00