Commit Graph

26192 Commits

Author SHA1 Message Date
Trond Myklebust
0cb3284b53 NFSv4.1: Get rid of NFS4CLNT_LAYOUTRECALL
The NFS4CLNT_LAYOUTRECALL bit is a long-term impediment to scalability. It
basically stops all other recalls by a given server once any layout recall
is requested.

If the recall is for a different file, then we don't care.
If the recall applies to the same file, then we're in one of two situations:
Either we are in the case of a replay of an existing request, in which case
the session is supposed to deal with matters, or we are dealing with a
completely different request, in which case we should just try to process
it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-03-01 11:17:50 -05:00
Trond Myklebust
a59c30acfb NFSv4.1: Get rid of redundant NFS4CLNT_LAYOUTRECALL tests
The NFS4CLNT_LAYOUTRECALL tests in pnfs_layout_process and
pnfs_update_layout are redundant.

In the case of a bulk layout recall, we're always testing for
the NFS_LAYOUT_BULK_RECALL flay anyway.
In the case of a file or segment recall, the call to
pnfs_set_layout_stateid() updates the layout_header 'barrier'
sequence id, which triggers the test in pnfs_layoutgets_blocked()
and is less race-prone than NFS4CLNT_LAYOUTRECALL anyway.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-03-01 11:17:47 -05:00
Bob Peterson
a08fd280b5 GFS2: Unlock rindex mutex on glock error
This patch fixes an error path in function gfs2_rindex_update
that leaves the rindex mutex held.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-01 09:25:21 +00:00
Christoph Hellwig
4b217ed9e3 quota: make Q_XQUOTASYNC a noop
Now that XFS takes quota reservations into account there is no need to flush
anything before reporting quotas - in addition to beeing fully transactional
all quota information is also 100% coherent with the rest of the filesystem
now.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-29 14:10:42 -06:00
Christoph Hellwig
8960501191 xfs: include reservations in quota reporting
Report all quota usage including the currently pending reservations.  This
avoids the need to flush delalloc space before gathering quota information,
and matches quota enforcement, which already takes the reservations into
account.

This fixes xfstests 270.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-29 14:09:06 -06:00
Christoph Hellwig
18535a7e01 xfs: merge xfs_qm_export_dquot into xfs_qm_scall_getquota
The is no good reason to have these two separate, and for the next change
we would need the full struct xfs_dquot in xfs_qm_export_dquot, so better
just fold the code now instead of changing it spuriously.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-29 11:57:36 -06:00
Artem Bityutskiy
7ca58bad69 UBIFS: kill CUR_MAX_KEY_LEN macro
It is useless and confusing and may make people believe they may just
change it, which is not true, because this will also change the on-flash
format.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-02-29 18:43:01 +02:00
Artem Bityutskiy
c43be1085f UBIFS: do not use inc_link when i_nlink is zero
This patch changes the 'i_nlink' counter handling in 'ubifs_unlink()',
'ubifs_rmdir()' and 'ubifs_rename()'. In these function  'i_nlink' may become 0,
and if 'ubifs_jnl_update()' failed, we would use 'inc_nlink()' to restore
the previous 'i_nlink' value, which is incorrect from the VFS point of view and
would cause a 'WARN_ON()' (see 'inc_nlink() implementation).

This patches saves the previous 'i_nlink' value in a local variable and uses it
at the error path instead of calling 'inc_nlink()'. We do this only for the
inodes where 'i_nlink' may potentially become zero.

This change has been requested by Al Viro <viro@ZenIV.linux.org.uk>.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-02-29 16:10:20 +02:00
Artem Bityutskiy
b06283c7df UBIFS: make the dbg_lock spinlock static
Remove the usage of the 'dbg_lock' spinlock from 'dbg_err()' and make
it static.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-02-29 16:10:20 +02:00
Artem Bityutskiy
16c395ca72 UBIFS: increase dumps loglevel
Most of the time we use the dumping function to dump something in case
of error. We use 'KERN_DEBUG' printk level, and the drawback is that users
do not see them in the console, while they see the other error messages
in the console. The result is that they send bug reports which does not
contain a lot of useful information. This patch changes the printk level
of the dump functions to 'KERN_ERR' to correct the situation.

I documented it in the MTD web site that people have to send the 'dmesg' output
when submitting bug reposts - it did not help.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-02-29 16:10:20 +02:00
Artem Bityutskiy
78437368c8 UBIFS: amend recovery debugging message
Print LEB and offset as well.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-02-29 16:10:20 +02:00
Randy Dunlap
164974a8f2 ecryptfs: fix printk format warning for size_t
Fix printk format warning (from Linus's suggestion):

on i386:
  fs/ecryptfs/miscdev.c:433:38: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int'

and on x86_64:
  fs/ecryptfs/miscdev.c:433:38: warning: format '%u' expects type 'unsigned int', but argument 4 has type 'long unsigned int'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	Geert Uytterhoeven <geert@linux-m68k.org>
Cc:	Tyler Hicks <tyhicks@canonical.com>
Cc:	Dustin Kirkland <dustin.kirkland@gazzang.com>
Cc:	ecryptfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-28 16:55:30 -08:00
Paul Gortmaker
630d9c4727 fs: reduce the use of module.h wherever possible
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include.  Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-28 19:31:58 -05:00
Steven Whitehouse
08728f2d8b GFS2: Make bd_cmp() static
Add missing static to bd_cmp()

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 17:11:27 +00:00
Bob Peterson
4a36d08d0d GFS2: Sort the ordered write list
This patch sorts the ordered write list for GFS2 writes.
This increases the throughput for simultaneous writes.
For example, if you have ten processes, all doing:
dd if=/dev/zero of=/mnt/gfs2/fileX
on different files, the throughput will be much better.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 17:10:53 +00:00
Steven Whitehouse
66fc061bda GFS2: FITRIM ioctl support
The FITRIM ioctl provides an alternative way to send discard requests to
the underlying device. Using the discard mount option results in every
freed block generating a discard request to the block device. This can
be slow, since many block devices can only process discard requests of
larger sizes, and also such operations can be time consuming.

Rather than using the discard mount option, FITRIM allows a sweep of the
filesystem on an occasional basis, and also to optionally avoid sending
down discard requests for smaller regions.

In GFS2 FITRIM will work at resource group granularity. There is a flag
for each resource group which keeps track of which resource groups have
been trimmed. This flag is reset whenever a deallocation occurs in the
resource group, and set whenever a successful FITRIM of that resource
group has taken place. This helps to reduce repeated discard requests
for the same block ranges, again improving performance.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 17:10:21 +00:00
Steven Whitehouse
47ac5537a7 GFS2: Move two functions from log.c to lops.c
gfs2_log_get_buf() and gfs2_log_fake_buf() are both used
only in lops.c, so move them next to their callers and they
can then become static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 17:09:59 +00:00
Steven Whitehouse
a245769f25 GFS2: glock statistics gathering
The stats are divided into two sets: those relating to the
super block and those relating to an individual glock. The
super block stats are done on a per cpu basis in order to
try and reduce the overhead of gathering them. They are also
further divided by glock type.

In the case of both the super block and glock statistics,
the same information is gathered in each case. The super
block statistics are used to provide default values for
most of the glock statistics, so that newly created glocks
should have, as far as possible, a sensible starting point.

The statistics are divided into three pairs of mean and
variance, plus two counters. The mean/variance pairs are
smoothed exponential estimates and the algorithm used is
one which will be very familiar to those used to calculation
of round trip times in network code.

The three pairs of mean/variance measure the following
things:

 1. DLM lock time (non-blocking requests)
 2. DLM lock time (blocking requests)
 3. Inter-request time (again to the DLM)

A non-blocking request is one which will complete right
away, whatever the state of the DLM lock in question. That
currently means any requests when (a) the current state of
the lock is exclusive (b) the requested state is either null
or unlocked or (c) the "try lock" flag is set. A blocking
request covers all the other lock requests.

There are two counters. The first is there primarily to show
how many lock requests have been made, and thus how much data
has gone into the mean/variance calculations. The other counter
is counting queueing of holders at the top layer of the glock
code. Hopefully that number will be a lot larger than the number
of dlm lock requests issued.

So why gather these statistics? There are several reasons
we'd like to get a better idea of these timings:

1. To be able to better set the glock "min hold time"
2. To spot performance issues more easily
3. To improve the algorithm for selecting resource groups for
allocation (to base it on lock wait time, rather than blindly
using a "try lock")
Due to the smoothing action of the updates, a step change in
some input quantity being sampled will only fully be taken
into account after 8 samples (or 4 for the variance) and this
needs to be carefully considered when interpreting the
results.

Knowing both the time it takes a lock request to complete and
the average time between lock requests for a glock means we
can compute the total percentage of the time for which the
node is able to use a glock vs. time that the rest of the
cluster has its share. That will be very useful when setting
the lock min hold time.

The other point to remember is that all times are in
nanoseconds. Great care has been taken to ensure that we
measure exactly the quantities that we want, as accurately
as possible. There are always inaccuracies in any
measuring system, but I hope this is as accurate as we
can reasonably make it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 17:09:42 +00:00
Steven Whitehouse
a365fbf354 GFS2: Read resource groups on mount
This makes mount take slightly longer, but at the same time, the first
write to the filesystem will be faster too. It also means that if there
is a problem in the resource index, then we can refuse to mount rather
than having to try and report that when the first write occurs.

In addition, to avoid recursive locking, we hvae to take account of
instances when the rindex glock may already be held when we are
trying to update the rbtree of resource groups.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 09:52:39 +00:00
Bob Peterson
9e73f571ea GFS2: Ensure rindex is uptodate for fallocate
This patch fixes a problem whereby gfs2_grow was failing and causing GFS2
to assert. The problem was that when GFS2's fallocate operation tried to
acquire an "allocation" it made sure the rindex was up to date, and if not,
it called gfs2_rindex_update. However, if the file being fallocated was
the rindex itself, it was already locked at that point. By calling
gfs2_rindex_update at an earlier point in time, we bring rindex up to date
and thereby avoid trying to lock it when the "allocation" is acquired.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 09:48:30 +00:00
Bob Peterson
718b97bd6b GFS2: Read in rindex if necessary during unlink
This patch fixes a problem whereby you were unable to delete
files until other file system operations were done (such as
statfs, touch, writes, etc.) that caused the rindex to be
read in.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 09:48:02 +00:00
Steven Whitehouse
4043b886b0 GFS2: Fix race between lru_list and glock ref count
This patch fixes a narrow race window between the glock ref count
hitting zero and glocks being removed from the lru_list.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28 09:43:07 +00:00
Stanislav Kinsbursky
e9dbca8d73 NFS: release per-net clients lock before calling PipeFS dentries creation
v3:
1) Lookup for client is performed from the beginning of the list on each PipeFS
event handling operation.

Lockdep is sad otherwise, because inode mutex is taken on PipeFS dentry
creation, which can be called on mount notification, where this per-net client
lock is taken on clients list walk.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-27 13:36:35 -05:00
Anton Altaparmakov
f621c53343 Merge branch 'master' of /Volumes/CaseSensitiveDisk/linux 2012-02-27 09:01:22 +00:00
Jeff Layton
5bccda0ebc cifs: fix dentry refcount leak when opening a FIFO on lookup
The cifs code will attempt to open files on lookup under certain
circumstances. What happens though if we find that the file we opened
was actually a FIFO or other special file?

Currently, the open filehandle just ends up being leaked leading to
a dentry refcount mismatch and oops on umount. Fix this by having the
code close the filehandle on the server if it turns out not to be a
regular file. While we're at it, change this spaghetti if statement
into a switch too.

Cc: stable@vger.kernel.org
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-26 23:16:26 -06:00
Pavel Shilovsky
6de2ce4231 CIFS: Fix mkdir/rmdir bug for the non-POSIX case
Currently we do inc/drop_nlink for a parent directory for every
mkdir/rmdir calls. That's wrong when Unix extensions are disabled
because in this case a server doesn't follow the same semantic and
returns the old value on the next QueryInfo request. As the result,
we update our value with the server one and then decrement it on
every rmdir call - go to negative nlink values.

Fix this by removing inc/drop_nlink for the parent directory from
mkdir/rmdir, setting it for a revalidation and ignoring NumberOfLinks
for directories when Unix extensions are disabled.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-26 22:59:43 -06:00
Benjamin Herrenschmidt
f851013cb2 Merge remote-tracking branch 'origin/master' into next 2012-02-27 10:50:11 +11:00
Trond Myklebust
7df529af5f NFSv4.1: Don't call nfs4_deviceid_purge_client() unless we're NFSv4.1
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-26 17:34:22 -05:00
Ian Kent
a32744d4ab autofs: work around unhappy compat problem on x86-64
When the autofs protocol version 5 packet type was added in commit
5c0a32fc2c ("autofs4: add new packet type for v5 communications"), it
obvously tried quite hard to be word-size agnostic, and uses explicitly
sized fields that are all correctly aligned.

However, with the final "char name[NAME_MAX+1]" array at the end, the
actual size of the structure ends up being not very well defined:
because the struct isn't marked 'packed', doing a "sizeof()" on it will
align the size of the struct up to the biggest alignment of the members
it has.

And despite all the members being the same, the alignment of them is
different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
alignment on x86-64.  And while 'NAME_MAX+1' ends up being a nice round
number (256), the name[] array starts out a 4-byte aligned.

End result: the "packed" size of the structure is 300 bytes: 4-byte, but
not 8-byte aligned.

As a result, despite all the fields being in the same place on all
architectures, sizeof() will round up that size to 304 bytes on
architectures that have 8-byte alignment for u64.

Note that this is *not* a problem for 32-bit compat mode on POWER, since
there __u64 is 8-byte aligned even in 32-bit mode.  But on x86, 32-bit
and 64-bit alignment is different for 64-bit entities, and as a result
the structure that has exactly the same layout has different sizes.

So on x86-64, but no other architecture, we will just subtract 4 from
the size of the structure when running in a compat task.  That way we
will write the properly sized packet that user mode expects.

Not pretty.  Sadly, this very subtle, and unnecessary, size difference
has been encoded in user space that wants to read packets of *exactly*
the right size, and will refuse to touch anything else.

Reported-and-tested-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-25 12:10:27 -08:00
Alex Elder
ad637a10f4 xfs: only take the ILOCK in xfs_reclaim_inode()
At the end of xfs_reclaim_inode(), the inode is locked in order to
we wait for a possible concurrent lookup to complete before the
inode is freed.  This synchronization step was taking both the ILOCK
and the IOLOCK, but the latter was causing lockdep to produce
reports of the possibility of deadlock.

It turns out that there's no need to acquire the IOLOCK at this
point anyway.  It may have been required in some earlier version of
the code, but there should be no need to take the IOLOCK in
xfs_iget(), so there's no (longer) any need to get it here for
synchronization.  Add an assertion in xfs_iget() as a reminder
of this assumption.

Dave Chinner diagnosed this on IRC, and Christoph Hellwig suggested
no longer including the IOLOCK.  I just put together the patch.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-25 13:55:49 -06:00
Masami Ichikawa
93518dd2eb sysfs: Fix memory leak in sysfs_sd_setsecdata().
This patch fixies follwing two memory leak patterns that reported by kmemleak.
sysfs_sd_setsecdata() is called during sys_lsetxattr() operation.
It checks sd->s_iattr is NULL or not. Then if it is NULL, it calls
sysfs_init_inode_attrs() to allocate memory.
That code is this.

iattrs = sd->s_iattr;
if (!iattrs)
                iattrs = sysfs_init_inode_attrs(sd);

The iattrs recieves sysfs_init_inode_attrs()'s result,  but sd->s_iattr
doesn't know the address. so it needs to set correct address to
sd->s_iattr to free memory in other function.

unreferenced object 0xffff880250b73e60 (size 32):
  comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
  hex dump (first 32 bytes):
    73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f  system_u:object_
    72 3a 73 79 73 66 73 5f 74 3a 73 30 00 00 00 00  r:sysfs_t:s0....
  backtrace:
    [<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98
    [<ffffffff811270ab>] __kmalloc+0x100/0x12c
    [<ffffffff8120775a>] context_struct_to_string+0x106/0x210
    [<ffffffff81207cc1>] security_sid_to_context_core+0x10b/0x129
    [<ffffffff812090ef>] security_sid_to_context+0x10/0x12
    [<ffffffff811fb0da>] selinux_inode_getsecurity+0x7d/0xa8
    [<ffffffff811fb127>] selinux_inode_getsecctx+0x22/0x2e
    [<ffffffff811f4d62>] security_inode_getsecctx+0x16/0x18
    [<ffffffff81191dad>] sysfs_setxattr+0x96/0x117
    [<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9
    [<ffffffff811543d9>] vfs_setxattr+0x83/0xa1
    [<ffffffff811544c6>] setxattr+0xcf/0x101
    [<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f
    [<ffffffff814efda9>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88024163c5a0 (size 96):
  comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
  hex dump (first 32 bytes):
    00 00 00 00 ed 41 00 00 00 00 00 00 00 00 00 00  .....A..........
    00 00 00 00 00 00 00 00 0c 64 42 4f 00 00 00 00  .........dBO....
  backtrace:
    [<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98
    [<ffffffff81127402>] kmem_cache_alloc_trace+0xc4/0xee
    [<ffffffff81191cbe>] sysfs_init_inode_attrs+0x2a/0x83
    [<ffffffff81191dd6>] sysfs_setxattr+0xbf/0x117
    [<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9
    [<ffffffff811543d9>] vfs_setxattr+0x83/0xa1
    [<ffffffff811544c6>] setxattr+0xcf/0x101
    [<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f
    [<ffffffff814efda9>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
`

Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-24 14:25:49 -08:00
Oleg Nesterov
971316f050 epoll: ep_unregister_pollwait() can use the freed pwq->whead
signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24 11:42:50 -08:00
Oleg Nesterov
d80e731eca epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

	- we do not have wake_up_all_poll() but wake_up_poll()
	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
	  we need a couple of simple changes in eventpoll.c to
	  make sure it can't be "lost".

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24 11:42:50 -08:00
Linus Torvalds
855a85f704 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Quoth Chris:
 "This is later than I wanted because I got backed up running through
  btrfs bugs from the Oracle QA teams.  But they are all bug fixes that
  we've queued and tested since rc1.

  Nothing in particular stands out, this just reflects bug fixing and QA
  done in parallel by all the btrfs developers.  The most user visible
  of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

  Because that helps deal with out of date drives (say an iscsi disk
  that has gone away and come back).  The old code wasn't always
  properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix compiler warnings on 32 bit systems
  Btrfs: increase the global block reserve estimates
  Btrfs: clear the extent uptodate bits during parent transid failures
  Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
  Btrfs: make sure we update latest_bdev
  Btrfs: improve error handling for btrfs_insert_dir_item callers
  Btrfs: be less strict on finding next node in clear_extent_bit
  Btrfs: fix a bug on overcommit stuff
  Btrfs: kick out redundant stuff in convert_extent_bit
  Btrfs: skip states when they does not contain bits to clear
  Btrfs: check return value of lookup_extent_mapping() correctly
  Btrfs: fix deadlock on page lock when doing auto-defragment
  Btrfs: fix return value check of extent_io_ops
  btrfs: honor umask when creating subvol root
  btrfs: silence warning in raid array setup
  btrfs: fix structs where bitfields and spinlock/atomic share 8B word
  btrfs: delalloc for page dirtied out-of-band in fixup worker
  Btrfs: fix memory leak in load_free_space_cache()
  btrfs: don't check DUP chunks twice
  Btrfs: fix trim 0 bytes after a device delete
  ...
2012-02-24 09:02:53 -08:00
Chris Mason
e77266e4c4 Btrfs: fix compiler warnings on 32 bit systems
The enospc tracing code added some interesting uses of
u64 pointer casts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-24 10:39:05 -05:00
Anton Altaparmakov
0afa1b62e3 Merge branch 'master' of /Volumes/CaseSensitiveDisk/linux 2012-02-24 09:39:16 +00:00
Anton Altaparmakov
9b556248ec NTFS: Correct two spelling errors "dealocate" to "deallocate" in mft.c.
From: Masanari Iida <standby24x7@gmail.com>

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-24 09:17:09 +00:00
Anton Altaparmakov
37fbf4bfb8 Restore direct_io / truncate locking API
With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
calls (namely inode_dio_wait() and inode_dio_done()) which are
EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
further inode_dio_wait() was pushed from notify_change() into the file
system ->setattr() method but no non-GPL file system can make this call.

That means non-GPL file systems cannot exist any more unless they do not
use any VFS functionality related to reading/writing as far as I can
tell or at least as long as they want to implement direct i/o.

Both Linus and Al (and others) have said on LKML that this breakage of
the VFS API should not have happened and that the change was simply
missed as it was not documented in the change logs of the patches that
did those changes.

This patch changes the two function exports in question to be
EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
for all modules.

Christoph, who introduced the two functions and exported them GPL-only
is CC-ed on this patch to give him the opportunity to object to the
symbols being changed in this manner if he did indeed intend them to be
GPL-only and does not want them to become available to all modules.

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-23 15:56:21 -08:00
Linus Torvalds
bb4c7e9a99 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
A fix from Jesper Juhl removes an assignment in an ASSERT when a compare
is intended.  Two fixes from Mitsuo Hayasaka address off-by-ones in XFS
quota enforcement.

* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: make inode quota check more general
  xfs: change available ranges of softlimit and hardlimit in quota check
  XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended
2012-02-23 15:38:57 -08:00
Liu Bo
5500cdbe14 Btrfs: increase the global block reserve estimates
When doing IO with large amounts of data fragmentation, the global block
reserve calulations are too low.  This increases them to avoid
ENOSPC crashes.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:49:04 -05:00
Chris Mason
5065319052 Btrfs: clear the extent uptodate bits during parent transid failures
If btrfs reads a block and finds a parent transid mismatch, it clears
the uptodate flags on the extent buffer, and the pages inside it.  But
we only clear the uptodate bits in the state tree if the block straddles
more than one page.

This is from an old optimization from to reduce contention on the extent
state tree.  But it is buggy because the code that retries a read from
a different copy of the block is going to find the uptodate state bits
set and skip the IO.

The end result of the bug is that we'll never actually read the good
copy (if there is one).

The fix here is to always clear the uptodate state bits, which is safe
because this code is only called when the parent transid fails.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Chris Mason
16780cabb8 Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Chris Mason
a6b0d5c8db Btrfs: make sure we update latest_bdev
When we are setting up the mount, we close all the
devices that were not actually part of the metadata we found.

But, we don't make sure that one of those devices wasn't
fs_devices->latest_bdev, which means we can do a use after free
on the one we closed.

This updates latest_bdev as it goes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Chris Mason
fe66a05a06 Btrfs: improve error handling for btrfs_insert_dir_item callers
This allows us to gracefully continue if we aren't able to insert
directory items, both for normal files/dirs and snapshots.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
David Smith
4ff16c25e2 tracepoint, vfs, sched: Add exec() tracepoint
Added a minimal exec tracepoint. Exec is an important major event
in the life of a task, like fork(), clone() or exit(), all of
which we already trace.

[ We also do scheduling re-balancing during exec() - so it's useful
  from a scheduler instrumentation POV as well. ]

If you want to watch a task start up, when it gets exec'ed is a good place
to start.  With the addition of this tracepoint, exec's can be monitored
and better picture of general system activity can be obtained. This
tracepoint will also enable better process life tracking, allowing you to
answer questions like "what process keeps starting up binary X?".

This tracepoint can also be useful in ftrace filtering and trigger
conditions: i.e. starting or stopping filtering when exec is called.

Signed-off-by: David Smith <dsmith@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F314D19.7030504@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-23 09:28:06 +01:00
Christoph Hellwig
9006fb91cf xfs: split and cleanup xfs_log_reserve
Split the log regrant case out of xfs_log_reserve into a separate function,
and merge xlog_grant_log_space and xlog_regrant_write_log_space into their
respective callers.  Also replace the XFS_LOG_PERM_RESERV flag, which easily
got misused before the previous cleanups with a simple boolean parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:37:04 -06:00
Christoph Hellwig
42ceedb3ca xfs: share code for grant head availability checks
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:34:03 -06:00
Christoph Hellwig
e179840d74 xfs: share code for grant head wakeups
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:31:45 -06:00
Christoph Hellwig
23ee3df349 xfs: share code for grant head waiting
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:29:39 -06:00
Christoph Hellwig
a79bf2d75b xfs: add xlog_grant_head_wake_all
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:26:47 -06:00
Christoph Hellwig
c303c5b8c3 xfs: add xlog_grant_head_init
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:21:39 -06:00
Christoph Hellwig
28496968a6 xfs: add the xlog_grant_head structure
Add a new data structure to allow sharing code between the log grant and
regrant code.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:19:53 -06:00
Christoph Hellwig
14a7235fba xfs: remove log space waitqueues
The tic->t_wait waitqueues can never have more than a single waiter
on them, so we can easily replace them with a task_struct pointer
and wake_up_process.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:17:00 -06:00
Christoph Hellwig
cfb7cdca0a xfs: cleanup xfs_log_space_wake
Remove the now unused opportunistic parameter, and use the the
xlog_writeq_wake and xlog_reserveq_wake helpers now that we don't have
to care about the opportunistic wakeups.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:17:00 -06:00
Christoph Hellwig
5b03ff1b24 xfs: remove xfs_trans_unlocked_item
There is no reason to wake up log space waiters when unlocking inodes or
dquots, and the commit log has no explanation for this function either.

Given that we now have exact log space wakeups everywhere we can assume
the reason for this function was to paper over log space races in earlier
XFS versions.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:17:00 -06:00
Christoph Hellwig
3af1de753b xfs: do exact log space wakeups in xlog_ungrant_log_space
The only reason that xfs_log_space_wake had to do opportunistic wakeups
was that the old xfs_log_move_tail calling convention didn't allow for
exact wakeups when not updating the log tail LSN.  Since this issue has
been fixed we can do exact wakeups now.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:17:00 -06:00
Christoph Hellwig
09a423a3d6 xfs: split tail_lsn assignments from log space wakeups
Currently xfs_log_move_tail has a tail_lsn argument that is horribly
overloaded: it may contain either an actual lsn to assign to the log tail,
0 as a special case to use the last sync LSN, or 1 to indicate that no tail
LSN assignment should be performed, and we should opportunisticly wake up
at one task waiting for log space even if we did not move the LSN.

Remove the tail lsn assigned from xfs_log_move_tail and make the two callers
use xlog_assign_tail_lsn instead of the current variant of partially using
the code in xfs_log_move_tail and partially opencoding it.  Note that means
we grow an addition lock roundtrip on the AIL lock for each bulk update
or delete, which is still far less than what we had before introducing the
bulk operations.  If this proves to be a problem we can still add a variant
of xlog_assign_tail_lsn that expects the lock to be held already.

Also rename the remainder of xfs_log_move_tail to xfs_log_space_wake as
that name describes its functionality much better.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 22:17:00 -06:00
Mitsuo Hayasaka
70b5437653 xfs: cleanup quota check on disk blocks and inodes reservations
This patch is a cleanup of quota check on disk blocks and inodes
reservations, and changes it as follows.

(1) add a total_count variable to store the total number of
    current usages and new reservations for disk blocks and inodes,
    respectively.

(2) make it more readable to check if the local variables softlimit
    and hardlimit are positive. It has been changed as follows.
	    if (softlimit > 0ULL) -> if (softlimit)
	    if (hardlimit > 0ULL) -> if (hardlimit)
    This is because they are defined as xfs_qcnt_t which is unsigned.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-22 21:47:52 -06:00
Mahesh Salgaonkar
1625739376 fadump: Introduce cleanup routine to invalidate /proc/vmcore.
With the firmware-assisted dump support we don't require a reboot when we
are in second kernel after crash. The second kernel after crash is a normal
kernel boot and has knowledge about entire system RAM with the page tables
initialized for entire system RAM. Hence once the dump is saved to disk, we
can just release the reserved memory area for general use and continue
with second kernel as production kernel.

Hence when we release the reserved memory that contains dump data, the
'/proc/vmcore' will not be valid anymore. Hence this patch introduces
a cleanup routine that invalidates and removes the /proc/vmcore file. This
routine will be invoked before we release the reserved dump memory area.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-02-23 10:50:02 +11:00
Linus Torvalds
437cf4c7b7 Bugfixes for the NFS client.
Fix a nasty Oops in the NFSv4 getacl code, another source of infinite loops
 in the NFSv4 state recovery code, and a regression in NFSv4.1 session
 initialisation.
 Also deal with an NFSv4.1 memory leak.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJPRLALAAoJEGcL54qWCgDy5yoP/0eKjYERpqf00ETRnmQw6ngt
 PCaC33mUwfsJlLdSfW6hhG+2IhKkiWR4mraCo1Es9CYS7eSsPU9/djK62neHZG3s
 lJF2xcPlfvbiYtbyG2MsmtRffUl/XRbLLMfZMADgG4iQy1y4uTDxsaxcKrBIu5ig
 mQWlEpsYeE9TLea3Qvw6oHiXMKkrwdeR9Et5aGo3Y5hxbpDhD86yR1cyw2ds2aUP
 FmXk/ERqJDJpEt+uEQKFsMDzjcZ27r/nca/AhwE+cVQku6Fi9QdnFcdNtQmwFYq6
 iwYS/grUoW0tLV7Fv/iYNvZmVtXtS2Ng1VTR77gcLRXps0naEciuM5PO7MmuUe/j
 0aS/fV1s4/ZloRCu6l/gj2JmOAkh0joy6msQTcdEt4LQcvQSxtUmLQtWyLoGm6aa
 5LwTOFg25qRH6c7mJglSsiPdjgAcOIdwzH1gikSzSJeV2NjroZZ4nPW/noz0Idfq
 l36B9T3iHc0jZa6wr/Q/S1qF6RdfLYfuXcrDKcxJFDpuYNRNDN1jjOq+OKs5yGw9
 Fgui6r9/g0uDsEKWtqR30NzUnUx4PxY0hxduT04T2sxMMapU3oXcXWW4vRgly9be
 ASx/hAEnBovNwzHeKo9wVt3WdHqP2M4wMx8dD8hFLL1qPx9uj90IeclOVneNEHKa
 UMtC3BH1DBF44FYhZgT4
 =yq1J
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.3-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Bugfixes for the NFS client.

Fix a nasty Oops in the NFSv4 getacl code, another source of infinite
loops in the NFSv4 state recovery code, and a regression in NFSv4.1
session initialisation.

Also deal with an NFSv4.1 memory leak.

* tag 'nfs-for-3.3-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: fix server_scope memory leak
  NFSv4.1: Fix a NFSv4.1 session initialisation regression
  NFSv4: Ensure we throw out bad delegation stateids on NFS4ERR_BAD_STATEID
  NFSv4: Fix an Oops in the NFSv4 getacl code
2012-02-22 08:43:35 -08:00
Anton Altaparmakov
45d95bcd7a NTFS: Do not dereference pointer before checking for NULL.
Found by Coverity software (http://scan.coverity.com).

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-22 11:15:43 +00:00
Anton Altaparmakov
0e37579506 NTFS: Remove unused variable.
Found by Coverity software (http://scan.coverity.com).

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-22 10:49:58 +00:00
Linus Torvalds
faf309009e sys_poll: fix incorrect type for 'timeout' parameter
The 'poll()' system call timeout parameter is supposed to be 'int', not
'long'.

Now, the reason this matters is that right now 32-bit compat mode is
broken on at least x86-64, because the 32-bit code just calls
'sys_poll()' directly on x86-64, and the 32-bit argument will have been
zero-extended, turning a signed 'int' into a large unsigned 'long'
value.

We could just introduce a 'compat_sys_poll()' function for this, and
that may eventually be what we have to do, but since the actual standard
poll() semantics is *supposed* to be 'int', and since at least on x86-64
glibc sign-extends the argument before invocing the system call (so
nobody can actually use a 64-bit timeout value in user space _anyway_,
even in 64-bit binaries), the simpler solution would seem to be to just
fix the definition of the system call to match what it should have been
from the very start.

If it turns out that somebody somehow circumvents the user-level libc
64-bit sign extension and actually uses a large unsigned 64-bit timeout
despite that not being how poll() is supposed to work, we will need to
do the compat_sys_poll() approach.

Reported-by: Thomas Meyer <thomas@m3y3r.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-21 17:24:20 -08:00
Mitsuo Hayasaka
33e0edafd7 xfs: make inode quota check more general
The xfs checks quota when reserving disk blocks and inodes. In the block
reservation, it checks if the total number of blocks including current
usage and new reservation exceed quota. In the inode reservation,
it checks using the total number of inodes including only current usage
without new reservation. However, this inode quota check works well
since the caller of xfs_trans_dquot() always sets the argument of the
number of new inode reservation to 1 or 0 and inode is reserved one by
one in current xfs.

To make it more general, this patch changes it to the same way as the
block quota check.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit c922bbc819)
2012-02-21 10:13:59 -06:00
Mitsuo Hayasaka
d0a3fe67e3 xfs: change available ranges of softlimit and hardlimit in quota check
In general, quota allows us to use disk blocks and inodes up to each
limit, that is, they are available if they don't exceed their limitations.
Current xfs sets their available ranges to lower than them except disk
inode quota check. So, this patch changes the ranges to not beyond them.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 20f12d8ac0)
2012-02-21 10:13:49 -06:00
Mitsuo Hayasaka
c922bbc819 xfs: make inode quota check more general
The xfs checks quota when reserving disk blocks and inodes. In the block
reservation, it checks if the total number of blocks including current
usage and new reservation exceed quota. In the inode reservation,
it checks using the total number of inodes including only current usage
without new reservation. However, this inode quota check works well
since the caller of xfs_trans_dquot() always sets the argument of the
number of new inode reservation to 1 or 0 and inode is reserved one by
one in current xfs.

To make it more general, this patch changes it to the same way as the
block quota check.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-21 10:12:43 -06:00
Mitsuo Hayasaka
20f12d8ac0 xfs: change available ranges of softlimit and hardlimit in quota check
In general, quota allows us to use disk blocks and inodes up to each
limit, that is, they are available if they don't exceed their limitations.
Current xfs sets their available ranges to lower than them except disk
inode quota check. So, this patch changes the ranges to not beyond them.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-21 10:12:43 -06:00
Liu Bo
692e5759a4 Btrfs: be less strict on finding next node in clear_extent_bit
In clear_extent_bit, it is enough that next node is adjacent in tree level.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-21 16:02:10 +01:00
Justin P. Mattock
a80581d0d1 Typos: change aditional to additional.
The below patch fixes some typos "aditional" to "additional", and also fixes
a comment with another word mispelled.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-21 11:40:36 +01:00
Masanari Iida
0cc785ecbf cramfs: Fix typo in inode.c
Correct spelling "endianess" to "endianness" in
fs/cramfs/inode.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-21 11:40:35 +01:00
Linus Torvalds
8ebbfb4957 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Assorted fixes, sat in -next for a week or so...

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
  vfs: fix compat_sys_stat() handling of overflows in st_nlink
  quota: Fix deadlock with suspend and quotas
  vfs: Provide function to get superblock and wait for it to thaw
  vfs: fix panic in __d_lookup() with high dentry hashtable counts
  autofs4 - fix lockdep splat in autofs
  vfs: fix d_inode_lookup() dentry ref leak
2012-02-20 16:13:58 -08:00
Trond Myklebust
abd9669861 NFS: Ensure struct nfs_client holds a reference to the net namespace
Otherwise we have no guarantee that the net namespace won't just
disappear from underneath us once the task that created it
is destroyed.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
2012-02-19 08:46:49 +01:00
Trond Myklebust
9937347a1e NFS: Ensure that the nfs_client 'net' field is always set
Currently, the nfs_parsed_mount_data->net field is initialised in
the nfs_parse_mount_options() function, which means that it only
gets set if we're using text based mounts. The legacy binary
mount interface is therefore broken.

Fix is to initialise the ->net field in nfs_alloc_parsed_mount_data.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
2012-02-19 08:44:07 +01:00
Weston Andros Adamson
abe9a6d57b NFSv4: fix server_scope memory leak
server_scope would never be freed if nfs4_check_cl_exchange_flags() returned
non-zero

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-17 17:34:03 -05:00
Trond Myklebust
f86f36a6ae NFSv4.1: Fix a NFSv4.1 session initialisation regression
Commit aacd553 (NFSv4.1: cleanup init and reset of session slot tables)
introduces a regression in the session initialisation code. New tables
now find their sequence ids initialised to 0, rather than the mandated
value of 1 (see RFC5661).

Fix the problem by merging nfs4_reset_slot_table() and nfs4_init_slot_table().
Since the tbl->max_slots is initialised to 0, the test in
nfs4_reset_slot_table for max_reqs != tbl->max_slots will automatically
pass for an empty table.

Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-17 17:33:39 -05:00
Weston Andros Adamson
0a70219523 NFS: include filelayout DS rpc stats in mountstats
Include RPC statistics from all data servers in /proc/self/mountstats for pNFS
filelayout mounts.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-17 13:39:47 -05:00
Andy Adamson
b6bf6e7d6f NFSv4.1 set highest_used_slotid to NFS4_NO_SLOT
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-17 13:38:03 -05:00
Cong Wang
465c9343c5 ecryptfs: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-02-16 16:06:27 -06:00
Tyler Hicks
545d680938 eCryptfs: Copy up lower inode attrs after setting lower xattr
After passing through a ->setxattr() call, eCryptfs needs to copy the
inode attributes from the lower inode to the eCryptfs inode, as they
may have changed in the lower filesystem's ->setxattr() path.

One example is if an extended attribute containing a POSIX Access
Control List is being set. The new ACL may cause the lower filesystem to
modify the mode of the lower inode and the eCryptfs inode would need to
be updated to reflect the new mode.

https://launchpad.net/bugs/926292

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Sebastien Bacher <seb128@ubuntu.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: <stable@vger.kernel.org>
2012-02-16 16:06:27 -06:00
Tyler Hicks
4a26620df4 eCryptfs: Improve statfs reporting
statfs() calls on eCryptfs files returned the wrong filesystem type and,
when using filename encryption, the wrong maximum filename length.

If mount-wide filename encryption is enabled, the cipher block size and
the lower filesystem's max filename length will determine the max
eCryptfs filename length. Pre-tested, known good lengths are used when
the lower filesystem's namelen is 255 and a cipher with 8 or 16 byte
block sizes is used. In other, less common cases, we fall back to a safe
rounded-down estimate when determining the eCryptfs namelen.

https://launchpad.net/bugs/885744

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
2012-02-16 16:06:21 -06:00
Chuck Lever
d7c3267502 nfs: Clean up debugging in nfs_follow_mountpoint()
Clean up: Fix a debugging message which had an obsolete function name
in it (nfs_follow_mountpoint).

Introduced by commit 36d43a43 "NFS: Use d_automount() rather than
abusing follow_link()" (January 14, 2011)

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-16 15:05:16 -05:00
Liu Bo
d9b0218f6c Btrfs: fix a bug on overcommit stuff
When overcommitting, we should check the sum of pinned space and
bytes for delayed item.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16 17:23:18 +01:00
Liu Bo
9d47c7671d Btrfs: kick out redundant stuff in convert_extent_bit
clear_state_bit will do merge_state for us, so kick out the redundant one.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16 17:23:17 +01:00
Liu Bo
0449314a9c Btrfs: skip states when they does not contain bits to clear
Clearing a range's bits is different with setting them, since we don't
need to touch them when states do not contain bits we want.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16 17:23:17 +01:00
Tsutomu Itoh
285190d99f Btrfs: check return value of lookup_extent_mapping() correctly
This patch corrects error checking of lookup_extent_mapping().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16 17:23:17 +01:00
Miao Xie
600a45e1d5 Btrfs: fix deadlock on page lock when doing auto-defragment
When I ran xfstests circularly on a auto-defragment btrfs, the deadlock
happened.

Steps to reproduce:
[tty0]
 # export MOUNT_OPTIONS="-o autodefrag"
 # export TEST_DEV=<partition1>
 # export TEST_DIR=<mountpoint1>
 # export SCRATCH_DEV=<partition2>
 # export SCRATCH_MNT=<mountpoint2>
 # while [ 1 ]
 > do
 > ./check 091 127 263
 > sleep 1
 > done
[tty1]
 # while [ 1 ]
 > do
 > echo 3 > /proc/sys/vm/drop_caches
 > done

Several hours later, the test processes will hang on, and the deadlock will
happen on page lock.

The reason is that:
  Auto defrag task		Flush thread			Test task
				btrfs_writepages()
				  add ordered extent
				  (including page 1, 2)
				  set page 1 writeback
				  set page 2 writeback
				endio_fn()
				  end page 2 writeback
								release page 2
lock page 1
alloc and lock page 2
page 2 is not uptodate
  btrfs_readpage()
    start ordered extent()
    btrfs_writepages()
      try  to lock page 1

so deadlock happens.

Fix this bug by unlocking the page which is in writeback, and re-locking it
after the writeback end.

Signed-off-by: Miao Xie <miax@cn.fujitsu.com>
2012-02-16 17:23:16 +01:00
Tsutomu Itoh
013bd4c336 Btrfs: fix return value check of extent_io_ops
This patch adds the check on the return value of extent_io_ops.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16 17:23:16 +01:00
Florian Albrechtskirchinger
12fc9d0923 btrfs: honor umask when creating subvol root
Set the subvol root inode permissions based on the current umask.
2012-02-16 16:35:41 +01:00
Vitaliy Gusev
b4b9a0c1c8 nfs41: Verify channel's attributes accordingly to RFC v2
ca_maxoperations:

      For the backchannel, the server MUST
      NOT change the value the client offers.  For the fore channel,
      the server MAY change the requested value.

  ca_maxrequests:

       For the backchannel, the server MUST NOT change the
       value the client offers.  For the fore channel, the server MAY
       change the requested value.

Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 11:16:11 -05:00
David Sterba
8a33442694 btrfs: silence warning in raid array setup
Raid array setup code creates an extent buffer in an usual way. When the
PAGE_CACHE_SIZE is > super block size, the extent pages are not marked
up-to-date, which triggers a WARN_ON in the following
write_extent_buffer call. Add an explicit up-to-date call to silence the
warning.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-02-15 16:40:25 +01:00
David Sterba
c08782dacd btrfs: fix structs where bitfields and spinlock/atomic share 8B word
On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
disaterous effect.

https://lkml.org/lkml/2012/2/1/220

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-02-15 16:40:25 +01:00
Jeff Mahoney
87826df0ec btrfs: delalloc for page dirtied out-of-band in fixup worker
We encountered an issue that was easily observable on s/390 systems but
 could really happen anywhere. The timing just seemed to hit reliably
 on s/390 with limited memory.

 The gist is that when an unexpected set_page_dirty() happened, we'd
 run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
 properly set up for delalloc.

 This patch does the following:
 - Performs the missing delalloc in the fixup worker
 - Allow the start hook to return -EBUSY which informs __extent_writepage
   that it should mark the page skipped and not to redirty it. This is
   required since the fixup worker can fail with -ENOSPC and the page
   will have already been redirtied. That causes an Oops in
   drop_outstanding_extents later. Retrying the fixup worker could
   lead to an infinite loop. Deferring the page redirty also saves us
   some cycles since the page would be stuck in a resubmit-redirty loop
   until the fixup worker completes. It's not harmful, just wasteful.
 - If the fixup worker fails, we mark the page and mapping as errored,
   and end the writeback, similar to what we would do had the page
   actually been submitted to writeback.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-02-15 16:40:25 +01:00
Tsutomu Itoh
a7e221e900 Btrfs: fix memory leak in load_free_space_cache()
load_free_space_cache() has forgotten to free path.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-15 16:40:24 +01:00
Arne Jansen
859acaf1a2 btrfs: don't check DUP chunks twice
Because scrub enumerates the dev extent tree to find the chunks to scrub,
it currently finds each DUP chunk twice and also scrubs it twice. This
patch makes sure that scrub_chunk only checks that part of the chunk the
dev extent has been found for. This only changes the behaviour for DUP
chunks.

Reported-and-tested-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-02-15 16:40:24 +01:00
Liu Bo
2cac13e41b Btrfs: fix trim 0 bytes after a device delete
A user reported a bug of btrfs's trim, that is we will trim 0 bytes
after a device delete.

The reproducer:

$ mkfs.btrfs disk1
$ mkfs.btrfs disk2
$ mount disk1 /mnt
$ fstrim -v /mnt
$ btrfs device add disk2 /mnt
$ btrfs device del disk1 /mnt
$ fstrim -v /mnt

This is because after we delete the device, the block group may start from
a non-zero place, which will confuse trim to discard nothing.

Reported-by: Lutz Euler <lutz.euler@freenet.de>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-15 16:40:23 +01:00
Jeff Liu
6af021d8fc Btrfs: return the internal error unchanged if btrfs_get_extent_fiemap() call failed for SEEK_DATA/SEEK_HOLE inquiry
Given that ENXIO only means "offset beyond EOF" for either SEEK_DATA or SEEK_HOLE inquiry
in a desired file range, so we should return the internal error unchanged if btrfs_get_extent_fiemap()
call failed, rather than ENXIO.

Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2012-02-15 16:40:23 +01:00
Jan Schmidt
8f24b49688 Btrfs: avoid positive number with ERR_PTR
inode_ref_info() returns 1 when the element wasn't found and < 0 on error,
just like btrfs_search_slot(). In iref_to_path() it's an error when the
inode ref can't be found, thus we return ERR_PTR(ret) in that case. In order
to avoid ERR_PTR(1), we now set ret to -ENOENT in that case.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-02-15 16:40:23 +01:00
Keith Mannthey
941b2ddf71 btrfs: Sector Size check during Mount
Gracefully fail when trying to mount a BTRFS file system that has a
sectorsize smaller than PAGE_SIZE.

On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
then boot into a 64K PAGE_SIZE kernel.  Presently open_ctree fails in an
endless loop and hangs the machine in this situation.

My debugging has show this Sector size < Page size to be a non trivial
situation and a graceful exit from the situation would be nice for the
time being.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
2012-02-15 16:40:22 +01:00
Weston Andros Adamson
571b755401 NFS: dont allow minorversion= opt when vers != 4
Don't allow invalid 'vers' and 'minorversion' combinations in mount options,
such as "vers=3,minorversion=1".

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:52 -05:00
Trond Myklebust
685f50f918 NFSv4: Further reduce the footprint of the idmapper
Don't allocate the legacy idmapper tables until we actually need
them.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-02-15 00:19:51 -05:00
Trond Myklebust
e3da87066f NFSv4: The idmapper now depends on keyring functionality
Add the appropriate 'select KEYS' to the NFSv4 Kconfig entry.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:51 -05:00
Trond Myklebust
d073e9b541 NFSv4: Reduce the footprint of the idmapper
Instead of pre-allocating the storage for all the strings, we can
significantly reduce the size of that table by doing the allocation
when we do the downcall.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-02-15 00:19:50 -05:00
Weston Andros Adamson
7ced286e0a NFS: add mount options 'v4.0' and 'v4.1'
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:50 -05:00
Stanislav Kinsbursky
b6d1e83b4e NFS: fix nfs4_find_client_sessionid() arguments list
It's not compilable in case of CONFIG_NFS_V4_1 is not set.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:49 -05:00
Trond Myklebust
4c03ae4a89 NFS: Initialise the nfs_net->nfs_client_lock
Ensure that we initialise the nfs_net->nfs_client_lock spinlock.
Also ensure that nfs_server_remove_lists() doesn't try to
dereference server->nfs_client before that is initialised.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
2012-02-15 00:19:49 -05:00
Stanislav Kinsbursky
3b64739fb9 Lockd: shutdown NLM hosts in network namespace context
Lockd now managed in network namespace context. And this patch introduces
network namespace related NLM hosts shutdown in case of releasing per-net Lockd
resources.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:48 -05:00
Stanislav Kinsbursky
0e1cb5c0aa LockD: make NSM network namespace aware
NLM host is network namespace aware now.
So NSM have to take it into account.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:48 -05:00
Stanislav Kinsbursky
66697bfd6a LockD: make nlm hosts network namespace aware
This object depends on RPC client, and thus on network namespace.
So let's make it's allocation and lookup in network namespace context.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:48 -05:00
Stanislav Kinsbursky
bb2224df5f Lockd: per-net up and down routines introduced
This patch introduces per-net Lockd initialization and destruction routines.
The logic is the same as in global Lockd up and down routines. Probably the
solution is not the best one. But at least it looks clear.
So per-net "up" routine are called only in case of lockd is running already. If
per-net resources are not allocated yet, then service is being registered with
local portmapper and lockd sockets created.
Per-net "down" routine is called on every lockd_down() call in case of global
users counter is not zero.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:47 -05:00
Stanislav Kinsbursky
a9c5d73a8d Lockd: pernet usage counter introduced
Lockd is going to be shared between network namespaces - i.e. going to be able
to handle lock requests from different network namespaces. This means, that
network namespace related resources have to be allocated not once (like now),
but for every network namespace context, from which service is requested to
operate.
This patch implements Lockd per-net users accounting. New per-net counter is
used to determine, when per-net resources have to be freed.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:47 -05:00
Stanislav Kinsbursky
c228fa2038 Lockd: create permanent lockd sockets in current network namespace
This patch parametrizes Lockd permanent sockets creation routine by network
namespace context.
It also replaces hard-coded init_net with current network namespace context in
Lockd sockets creation routines.
This approach looks safe, because Lockd is created during NFS mount (or NFS
server start) and thus socket is required exactly in current network namespace
context. But in the same time it means, that Lockd sockets inherits first Lockd
requester network namespace. This issue will be fixed in further patches of the
series.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:46 -05:00
Trond Myklebust
ef159e9177 NFSv4.1: Add a module parameter to set the number of session slots
Add the module parameter 'max_session_slots' to set the initial number
of slots that the NFSv4.1 client will attempt to negotiate with the
server.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:44 -05:00
Trond Myklebust
45d43c291e NFSv4.1: Convert slotid from u8 to u32
It is perfectly legal to negotiate up to 2^32-1 slots in the protocol,
and with 10GigE, we are already seeing that 255 slots is far too limiting.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-15 00:19:43 -05:00
Linus Torvalds
ce5afed937 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  cifs: don't return error from standard_receive3 after marking response malformed
  cifs: request oplock when doing open on lookup
  cifs: fix error handling when cifscreds key payload is an error
2012-02-13 20:34:44 -08:00
Al Viro
847c9db5cb ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
unfortunately, nlink_t may be smaller than 32 bits and ->i_nlink
on ocfs2 can grow up to 0xffffffff; storing it in nlink_t variable
will lose upper bits on such architectures.  Needs to be made u32,
until we get kernel-side nlink_t uniformly 32bit...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:39 -05:00
Al Viro
fcf83067bf vfs: fix compat_sys_stat() handling of overflows in st_nlink
Massaged cp_compat_stat() into form closer to cp_new_stat(); the only
real issue had been in handling of st_nlink overflows - native 32bit
stat(2) returns -EOVERFLOW in such situations, compat one silently
loses upper bits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:39 -05:00
Jan Kara
dcdbed853d quota: Fix deadlock with suspend and quotas
This script causes a kernel deadlock:
set -e
DEVICE=/dev/vg1/linear
lvchange -ay $DEVICE
mkfs.ext3 $DEVICE
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotacheck -gu /mnt/test
umount /mnt/test
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotaon /mnt/test
dmsetup suspend $DEVICE
setquota -u root 1 2 3 4 /mnt/test &
sleep 1
dmsetup resume $DEVICE

setquota acquired semaphore s_umount for read and then tried to perform a
transaction (and waits because the device is suspended).  dmsetup resume tries
to acquire s_umount for write before resuming the device (and waits for
setquota).

Fix the deadlock by grabbing a thawed superblock for quota commands which need
it.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:39 -05:00
Jan Kara
6b6dc836a1 vfs: Provide function to get superblock and wait for it to thaw
In quota code we need to find a superblock corresponding to a device and wait
for superblock to be unfrozen. However this waiting has to happen without
s_umount semaphore because that is required for superblock to thaw. So provide
a function in VFS for this to keep dances with s_umount where they belong.

[AV: implementation switched to saner variant]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:38 -05:00
Dimitri Sivanich
074b85175a vfs: fix panic in __d_lookup() with high dentry hashtable counts
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup().  Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:38 -05:00
Steven Rostedt
1d6f209786 autofs4 - fix lockdep splat in autofs
When recursing down the locks when traversing a tree/list in
get_next_positive_dentry() or get_next_positive_subdir() a lock can
change from being nested to being a parent which breaks lockdep. This
patch tells lockdep about what we did.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:37 -05:00
Miklos Szeredi
e188dc02d3 vfs: fix d_inode_lookup() dentry ref leak
d_inode_lookup() leaks a dentry reference on IS_DEADDIR().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:37 -05:00
Al Viro
4040153087 security: trim security.h
Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-14 10:45:42 +11:00
Jesper Juhl
f65020a83a XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended
It looks to me like the two ASSERT()s in xfs_trans_add_item() really
want to do a compare (==) rather than assignment (=).
This patch changes it from the latter to the former.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 05293485a0)
2012-02-13 17:09:21 -06:00
Jesper Juhl
05293485a0 XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended
It looks to me like the two ASSERT()s in xfs_trans_add_item() really
want to do a compare (==) rather than assignment (=).
This patch changes it from the latter to the former.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-13 17:06:39 -06:00
Linus Torvalds
19be13cfe3 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
Two bugfixes in XFS for 3.3: one fix passes KMEM_SLEEP to kmem_realloc
instead of 0, and the other resolves a possible deadlock in xfs quotas.

* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: use a normal shrinker for the dquot freelist
  xfs: pass KM_SLEEP flag to kmem_realloc() in xlog_recover_add_to_cnt_trans()
2012-02-13 14:19:45 -08:00
Linus Torvalds
3ec1e88b33 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Says Jens:

 "Time to push off some of the pending items.  I really wanted to wait
  until we had the regression nailed, but alas it's not quite there yet.
  But I'm very confident that it's "just" a missing expire on exit, so
  fix from Tejun should be fairly trivial.  I'm headed out for a week on
  the slopes.

  - Killing the barrier part of mtip32xx.  It doesn't really support
    barriers, and it doesn't need them (writes are fully ordered).

  - A few fixes from Dan Carpenter, preventing overflows of integer
    multiplication.

  - A fixup for loop, fixing a previous commit that didn't quite solve
    the partial read problem from Dave Young.

  - A bio integer overflow fix from Kent Overstreet.

  - Improvement/fix of the door "keep locked" part of the cdrom shared
    code from Paolo Benzini.

  - A few cfq fixes from Shaohua Li.

  - A fix for bsg sysfs warning when removing a file it did not create
    from Stanislaw Gruszka.

  - Two fixes for floppy from Vivek, preventing a crash.

  - A few block core fixes from Tejun.  One killing the over-optimized
    ioc exit path, cleaning that up nicely.  Two others fixing an oops
    on elevator switch, due to calling into the scheduler merge check
    code without holding the queue lock."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix lockdep warning on io_context release put_io_context()
  relay: prevent integer overflow in relay_open()
  loop: zero fill bio instead of return -EIO for partial read
  bio: don't overflow in bio_get_nr_vecs()
  floppy: Fix a crash during rmmod
  floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
  cdrom: move shared static to cdrom_device_info
  bsg: fix sysfs link remove warning
  block: don't call elevator callbacks for plug merges
  block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
  mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
  block: strip out locking optimization in put_io_context()
  cdrom: use copy_to_user() without the underscores
  block: fix ioc locking warning
  block: fix NULL icq_cache reference
  block,cfq: change code order
2012-02-11 10:07:11 -08:00
Christoph Hellwig
92b2e5b31d xfs: use a normal shrinker for the dquot freelist
Stop reusing dquots from the freelist when allocating new ones directly, and
implement a shrinker that actually follows the specifications for the
interface.  The shrinker implementation is still highly suboptimal at this
point, but we can gradually work on it.

This also fixes an bug in the previous lock ordering, where we would take
the hash and dqlist locks inside of the freelist lock against the normal
lock ordering.  This is only solvable by introducing the dispose list,
and thus not when using direct reclaim of unused dquots for new allocations.

As a side-effect the quota upper bound and used to free ratio values in
/proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the
new world order.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 04da0c8196)
2012-02-10 12:38:09 -06:00
Greg Kroah-Hartman
5a22e30def Merge tag 'tty-3.3-rc3' tty-next
This is needed to handle the 8250 file merge mess properly for future
patches.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-10 10:25:27 -08:00
Christoph Hellwig
04da0c8196 xfs: use a normal shrinker for the dquot freelist
Stop reusing dquots from the freelist when allocating new ones directly, and
implement a shrinker that actually follows the specifications for the
interface.  The shrinker implementation is still highly suboptimal at this
point, but we can gradually work on it.

This also fixes an bug in the previous lock ordering, where we would take
the hash and dqlist locks inside of the freelist lock against the normal
lock ordering.  This is only solvable by introducing the dispose list,
and thus not when using direct reclaim of unused dquots for new allocations.

As a side-effect the quota upper bound and used to free ratio values in
/proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the
new world order.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-10 12:02:05 -06:00
Linus Torvalds
af5feae3d7 fix 1 mysterious divide error
fix 3 NULL dereference bugs in writeback tracing, on SD card removal w/o umount
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPNI/ZAAoJECvKgwp+S8JaXNsP/3UwYM4R/bIqjsGSEr8mpxzs
 L/9hq85Vql+HDIZ0QT2Zj8aYcF2iYhjxrrVGVjNmINY3bSvniqtrZ6oCejdj7wqR
 vb2ECC3csUnvUbbewCOM4EaowU2CoANhO5xZeDzOu9SnYfMPuxRzjFlxU5WehJm1
 5dKcCtbaO9Bleo5aZyr2AAaZPgE2lG7Hrvk8HghPhEw7ZBtO1Pc3iVegEhIvRiZR
 tUNTCwxE7QV1GehTUTgGpJWNL4qzrbyiqm/Vg+yI27l13IPn6mb/qfe7eHDFUTCb
 Ey6oeojhmmv0Kgc7b38/0U6q1QNL8x+zJP3J21wMmYqn2DtkLgZkI4TAcmBZwwHi
 rGvrwQESzTpiuhdXxQEOQpmrd8IvTmiFQK+IZzJ3uUA197ROdxyWLmdbbMZvsLym
 8rtC+WNR0IJmPmnWNl1pj2df8YmtWkAGLaw2RMj4RFz3AcXBRurAOrCVG8Lk8ptH
 pFS0n4W3ScuTrZFy1jXYjpVumeIAuWJ/ScPJZhVsDJmssZWv4ZNr/X+OExq0z3dJ
 g9IBJ64q1zJiD5gSs2+iXmBTEHP6lpap9hY9WjApep7RuDsM9+o78oVEJcGdXbRM
 StFJoFdyOrsIR0cuo4yd+Lp/1ZpqP2ES++itW2PA96RXAuP/4R040xXqK/qMEczW
 XfCHqpOIqpCF7lxt9bcc
 =shjO
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

fix 1 mysterious divide error
fix 3 NULL dereference bugs in writeback tracing, on SD card removal w/o umount

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
  lib: proportion: lower PROP_MAX_SHIFT to 32 on 64-bit kernel
  writeback: fix NULL bdi->dev in trace writeback_single_inode
  backing-dev: fix wakeup timer races with bdi_unregister()
2012-02-10 09:05:52 -08:00
Jesper Juhl
3e93b8dfd9 BTRFS: Don't include disk-io.h twice in check-integrity.c
Once should be enough.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-10 09:52:19 +01:00
Masanari Iida
42ea19790e jffs2: Fix typo in compr.c
Correct spelling "modul" to "module" in
fs/hffs2/compr.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-09 23:09:37 +01:00
Masanari Iida
934e7d44b8 btrfs: Fix typo in free-space-cache.c
Correct spelling "cace" to "cache" in
fs/btrfs/free-space-cache.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-09 23:09:36 +01:00
Trond Myklebust
b9f9a03150 NFSv4: Ensure we throw out bad delegation stateids on NFS4ERR_BAD_STATEID
To ensure that we don't just reuse the bad delegation when we attempt to
recover the nfs4_state that received the bad stateid error.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-02-09 15:59:21 -05:00
James Morris
9e3ff38647 Merge branch 'next-queue' into next 2012-02-09 17:02:34 +11:00
Xi Wang
1ecd3c7ea7 nilfs2: avoid overflowing segment numbers in nilfs_ioctl_clean_segments()
nsegs is read from userspace.  Limit its value and avoid overflowing nsegs
* sizeof(__u64) in the subsequent call to memdup_user().

This patch complements 481fe17e97 ("nilfs2: potential integer overflow
in nilfs_ioctl_clean_segments()").

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: Haogang Chen <haogangchen@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-08 19:03:51 -08:00
Kent Overstreet
5abebfdd02 bio: don't overflow in bio_get_nr_vecs()
There were two places bio_get_nr_vecs() could overflow:

First, it did a left shift to convert from sectors to bytes immediately
before dividing by PAGE_SIZE.  If PAGE_SIZE ever was less than 512 a great
many things would break, so dividing by PAGE_SIZE >> 9 is safe and will
generate smaller code too.

The nastier overflow was in the DIV_ROUND_UP() (that's what the code was
effectively doing, anyways).  If n + d overflowed, the whole thing would
return 0 which breaks things rather effectively.

bio_get_nr_vecs() doesn't claim to give an exact value anyways, so the
DIV_ROUND_UP() is silly; we could do a straight divide except if a
device's queue_max_sectors was less than PAGE_SIZE we'd return 0.  So we
just add 1; this should always be safe - things will break badly if
bio_get_nr_vecs() returns > BIO_MAX_PAGES (bio_alloc() will suddenly start
failing) but it's queue_max_segments that must guard against this, if
queue_max_sectors is preventing this from happen things are going to
explode on architectures with different PAGE_SIZE.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 22:07:18 +01:00
Jeff Layton
ff4fa4a25a cifs: don't return error from standard_receive3 after marking response malformed
standard_receive3 will check the validity of the response from the
server (via checkSMB). It'll pass the result of that check to handle_mid
which will dequeue it and mark it with a status of
MID_RESPONSE_MALFORMED if checkSMB returned an error. At that point,
standard_receive3 will also return an error, which will make the
demultiplex thread skip doing the callback for the mid.

This is wrong -- if we were able to identify the request and the
response is marked malformed, then we want the demultiplex thread to do
the callback. Fix this by making standard_receive3 return 0 in this
situation.

Cc: stable@vger.kernel.org
Reported-and-Tested-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-07 22:25:31 -06:00
Jeff Layton
8b0192a5f4 cifs: request oplock when doing open on lookup
Currently, it's always set to 0 (no oplock requested).

Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-07 22:25:29 -06:00
Jeff Layton
4edc53c1f8 cifs: fix error handling when cifscreds key payload is an error
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-07 22:25:26 -06:00
Linus Torvalds
84f8bf38b9 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix oops in session setup code for null user mounts
  [CIFS] Update cifs Kconfig title to match removal of experimental dependency
  cifs: fix printk format warnings
  cifs: check offset in decode_ntlmssp_challenge()
  cifs: NULL dereference on allocation failure
2012-02-07 14:07:20 -08:00
Tejun Heo
11a3122f6c block: strip out locking optimization in put_io_context()
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue.  It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization.  Strip it out.  If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-07 07:51:30 +01:00
Stanislav Kinsbursky
17347d03c0 NFS: build fixed in case of NFS_USE_NEW_IDMAPPER is undefined
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:05 -05:00
Stanislav Kinsbursky
33faaa380e NFS: pass transport net to rpc_pton() while parse server name
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:05 -05:00
Stanislav Kinsbursky
b48e127884 NFS: pass current net to rpc_pton() while parsing mount options
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:04 -05:00
Stanislav Kinsbursky
c7add9a972 NFS: search for client session id in proper network namespace
Network namespace is taken from request transport and passed as a part of
cb_process_state structure.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:04 -05:00
Stanislav Kinsbursky
bc224f539d NFS: pass proper net rpc_pton() in nfs_dns_resolve_name()
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:04 -05:00
Stanislav Kinsbursky
dc03085834 NFS: make nfs_client_lock per net ns
This patch makes nfs_clients_lock allocated per network namespace. All items it
protects are already network namespace aware.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:03 -05:00
Stanislav Kinsbursky
28cd1b3f26 NFS: make cb_ident_idr per net ns
This patch makes ID's infrastructure network namespace aware. This was done
mainly because of nfs_client_lock, which is desired to be per network
namespace, but protects NFS clients ID's.

NOTE: NFS client's net pointer have to be set prior to ID initialization,
proper assignment was moved.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:03 -05:00
Stanislav Kinsbursky
c25d32b263 NFS: make nfs_volume_list per net ns
This patch splits global list of NFS servers into per-net-ns array of lists.
This looks more strict and clearer.
BTW, this patch also makes "/proc/fs/nfsfs/volumes" content depends on /proc
mount owner pid namespace. See below for details.

NOTE: few words about how was /proc/fs/nfsfs/ entries content show per network
namespace done. This is a little bit tricky and not the best is could be. But
it's cheap (proper fix for /proc conteinerization is a hard nut to crack).
The idea is simple: take proper network namespace from pid namespace
child reaper nsproxy of /proc/ mount creator.
This actually means, that if there are 2 containers with different net
namespace sharing pid namespace, then read of /proc/fs/nfsfs/ entries will
always return content, taken from net namespace of pid namespace creator task
(and thus second namespace set wil be unvisible).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:02 -05:00
Stanislav Kinsbursky
6b13168b36 NFS: make nfs_client_list per net ns
This patch splits global list of NFS clients into per-net-ns array of lists.
This looks more strict and clearer.
BTW, this patch also makes "/proc/fs/nfsfs/servers" entry content depends on
/proc mount owner pid namespace. See below for details.

NOTE: few words about how was /proc/fs/nfsfs/ entries content show per network
namespace done. This is a little bit tricky and not the best is could be. But
it's cheap (proper fix for /proc conteinerization is a hard nut to crack).
The idea is simple: take proper network namespace from pid namespace
child reaper nsproxy of /proc/ mount creator.
This actually means, that if there are 2 containers with different net
namespace sharing pid namespace, then read of /proc/fs/nfsfs/ entries will
always return content, taken from net namespace of pid namespace creator task
(and thus second namespace set wil be unvisible).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:02 -05:00
Bryan Schumaker
3cd0f37a2c NFS: Keep idmapper include files in one place
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:01 -05:00
Bryan Schumaker
e6499c6f4b NFS: Fall back on old idmapper if request_key() fails
This patch removes the CONFIG_NFS_USE_NEW_IDMAPPER compile option.
First, the idmapper will attempt to map the id using /sbin/request-key
and nfsidmap.  If this fails (if /etc/request-key.conf is not configured
properly) then the idmapper will call the legacy code to perform the
mapping.  I left a comment stating where the legacy code begins to make
it easier for somebody to remove in the future.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:01 -05:00
Weston Andros Adamson
2d3fe01c36 NFS: Fix comparison between DS address lists
data_server_cache entries should only be treated as the same if the address
list hasn't changed.

A new entry will be made when an MDS changes an address list (as seen by
GETDEVINFO). The old entry will be freed once all references are gone.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:00 -05:00
Weston Andros Adamson
a030889a01 NFS: start printks w/ NFS: even if __func__ shown
This patch addresses printks that have some context to show that they are
from fs/nfs/, but for the sake of consistency now start with NFS:

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:48:00 -05:00
Weston Andros Adamson
f9fd2d9c1f NFS: printks in fs/nfs/ should start with NFS:
Messages like "Got error -10052 from the server on DESTROY_SESSION. Session
has been destroyed regardless" can be confusing to users who aren't very
familiar with NFS.

NOTE: This patch ignores any printks() that start by printing __func__ - that
will be in a separate patch.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:47:59 -05:00
Bryan Schumaker
b01dd1d8fa NFS: Call test_stateid() and free_stateid() with correct stateids
I noticed that test_stateid() was always using the same stateid for open
and lock recovery.  After poking around a bit, I discovered that it was
always testing with a delegation stateid (even if there was no
delegation present).  I figured this wasn't correct, so now delegation
and open stateids are tested during open_expired() and lock stateids are
tested during lock_expired().

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:47:58 -05:00
Bryan Schumaker
1cab0652ba NFS: Pass a stateid to test_stateid() and free_stateid()
This takes the guesswork out of what stateid to use.  The caller is
expected to figure this out and pass in the correct one.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-06 18:47:58 -05:00
Heiko Carstens
96e02d1586 exec: fix use-after-free bug in setup_new_exec()
Setting the task name is done within setup_new_exec() by accessing
bprm->filename. However this happens after flush_old_exec().
This may result in a use after free bug, flush_old_exec() may
"complete" vfork_done, which will wake up the parent which in turn
may free the passed in filename.
To fix this add a new tcomm field in struct linux_binprm which
contains the now early generated task name until it is used.

Fixes this bug on s390:

  Unable to handle kernel pointer dereference at virtual kernel address 0000000039768000
  Process kworker/u:3 (pid: 245, task: 000000003a3dc840, ksp: 0000000039453818)
  Krnl PSW : 0704000180000000 0000000000282e94 (setup_new_exec+0xa0/0x374)
  Call Trace:
  ([<0000000000282e2c>] setup_new_exec+0x38/0x374)
   [<00000000002dd12e>] load_elf_binary+0x402/0x1bf4
   [<0000000000280a42>] search_binary_handler+0x38e/0x5bc
   [<0000000000282b6c>] do_execve_common+0x410/0x514
   [<0000000000282cb6>] do_execve+0x46/0x58
   [<00000000005bce58>] kernel_execve+0x28/0x70
   [<000000000014ba2e>] ____call_usermodehelper+0x102/0x140
   [<00000000005bc8da>] kernel_thread_starter+0x6/0xc
   [<00000000005bc8d4>] kernel_thread_starter+0x0/0xc
  Last Breaking-Event-Address:
   [<00000000002830f0>] setup_new_exec+0x2fc/0x374

  Kernel panic - not syncing: Fatal exception: panic_on_oops

Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-06 15:15:20 -08:00
Linus Torvalds
71b1b20b8a - Fix a regression in 16-bit Atmel NAND flash which was introduced in 3.1
- Fix breakage with MTD suspend caused by the API rework
  - Fix a problem with resetting the MX28 BCH module
  - A couple of other trivial fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iEYEABECAAYFAk8s6HsACgkQdwG7hYl686MIiACgxpNoUWFvq8z+2UGXxsLnNrio
 hhcAn31H7TY3KUuIQBo4CqG2dEjNwpCw
 =DRWp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.3' of git://git.infradead.org/~dwmw2/mtd-3.3

 - Fix a regression in 16-bit Atmel NAND flash which was introduced in 3.1
 - Fix breakage with MTD suspend caused by the API rework
 - Fix a problem with resetting the MX28 BCH module
 - A couple of other trivial fixes

* tag 'for-linus-3.3-20120204' of git://git.infradead.org/~dwmw2/mtd-3.3:
  Revert "mtd: atmel_nand: optimize read/write buffer functions"
  mtd: fix MTD suspend
  jffs2: do not initialize variable unnecessarily
  mtd: gpmi-nand bugfix: reset the BCH module when it is not MX23
  mtd: nand: fix typo in comment
2012-02-04 07:17:47 -08:00
Trond Myklebust
331818f1c4 NFSv4: Fix an Oops in the NFSv4 getacl code
Commit bf118a342f (NFSv4: include bitmap
in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
where we may need to decode multi-page data. However it fails to take
into account the fact that the variable may be NULL (for the case where
we're not doing multi-page decode), and it also attaches it to the
encoding xdr_stream rather than the decoding one.

The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
call to page_address() with a NULL page pointer.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: stable@vger.kernel.org
2012-02-03 18:50:34 -05:00
Jiri Kosina
972c5ae961 Merge branch 'master' into for-next
Sync with Linus' tree to be able to apply patch to a newer
code (namely drivers/gpu/drm/gma500/psb_intel_lvds.c)
2012-02-03 23:13:05 +01:00
Masanari Iida
982a598ff6 ntfs: fix printk typos in mft.c
Correct two spelling errors "dealocate" to "deallocate"
in fs/ntfs/mft.c

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 22:49:17 +01:00
Masanari Iida
8112b9830a reiserfs: fix printk typo in lbalance.c
Correct spelling "entry_cout" to "entry_count" in
fs/reiserfs/lbalance.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 22:47:38 +01:00
Chandra Seetharaman
4177af3a8a Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage
Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage
of quota macros.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-03 11:32:20 -06:00
Chandra Seetharaman
6bd92a239f Change xfs_sb_from_disk() interface to take a mount pointer
Change xfs_sb_from_disk() interface to take a mount pointer
instead of a superblock pointer.

This is to print mount point specific error messages in future
fixes.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-03 11:21:33 -06:00
Chandra Seetharaman
3673141083 Define a new function xfs_inode_dquot()
Define a new function xfs_inode_dquot() that takes a inode pointer
and a disk quota type and returns the quota pointer for the specified
quota type.

This simplifies the xfs_qm_dqget() error path significantly.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-03 10:53:39 -06:00
Chandra Seetharaman
6967b964c1 Define a new function xfs_this_quota_on()
Create a new function xfs_this_quota_on() that takes a xfs_mount
data structure and a disk quota type and returns true if the specified
type of quota is ON in the xfs_mount data structure.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-03 09:42:53 -06:00
Linus Torvalds
6c073a7ee2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix safety of rbd_put_client()
  rbd: fix a memory leak in rbd_get_client()
  ceph: create a new session lock to avoid lock inversion
  ceph: fix length validation in parse_reply_info()
  ceph: initialize client debugfs outside of monc->mutex
  ceph: change "ceph.layout" xattr to be "ceph.file.layout"
2012-02-02 15:47:33 -08:00
Amit Sahrawat
b995730845 xfs: kill the unused XFS_BB_FSB_OFFSET macro
Removing the macro, as this is no more needed in the code.
Tried to find the reference when it was last used - but the usage
for this seemed to have been dropped long time ago.

Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-02 17:08:04 -06:00
Shirish Pargaonkar
de47a4176c cifs: Fix oops in session setup code for null user mounts
For null user mounts, do not invoke string length function
during session setup.

Cc: <stable@kernel.org
Reported-and-Tested-by: Chris Clayton <chris2553@googlemail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-02-02 16:59:09 -06:00
Christopher Yeoh
8cdb878dcb Fix race in process_vm_rw_core
This fixes the race in process_vm_core found by Oleg (see

  http://article.gmane.org/gmane.linux.kernel/1235667/

for details).

This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.

In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-02 12:55:17 -08:00
Alex Elder
d8fb02abdc ceph: create a new session lock to avoid lock inversion
Lockdep was reporting a possible circular lock dependency in
dentry_lease_is_valid().  That function needs to sample the
session's s_cap_gen and and s_cap_ttl fields coherently, but needs
to do so while holding a dentry lock.  The s_cap_lock field was
being used to protect the two fields, but that can't be taken while
holding a lock on a dentry within the session.

In most cases, the s_cap_gen and s_cap_ttl fields only get operated
on separately.  But in three cases they need to be updated together.
Implement a new lock to protect the spots updating both fields
atomically is required.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:19 -08:00
Xi Wang
32852a81bc ceph: fix length validation in parse_reply_info()
"len" is read from network and thus needs validation.  Otherwise, given
a bogus "len" value, p+len could be an out-of-bounds pointer, which is
used in further parsing.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:11 -08:00
Alex Elder
114fc47492 ceph: change "ceph.layout" xattr to be "ceph.file.layout"
The virtual extended attribute named "ceph.layout" is meaningful
only for regular files.  Change its name to be "ceph.file.layout" to
more directly reflect that in the ceph xattr namespace.  Preserve
the old "ceph.layout" name for the time being (until we decide it's
safe to get rid of it entirely).

Add a missing initializer for "readonly" in the terminating entry.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-02-02 12:48:52 -08:00
Greg Kroah-Hartman
bd1d462e13 Merge 3.3-rc2 into the driver-core-next branch.
This was done to resolve a merge and build problem with the
drivers/acpi/processor_driver.c file.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-02 11:24:44 -08:00
Eric W. Biederman
4e75732035 sysctl: Don't call sysctl_follow_link unless we are a link.
There are no functional changes.  Just code motion to make it
clear that we don't follow a link between sysctl roots unless the
directory entry actually is a link.

Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:21:38 -08:00
Eric W. Biederman
60f126d93b sysctl: Comments to make the code clearer.
Document get_subdir and that find_subdir alwasy takes a reference.

Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:57 -08:00
Eric W. Biederman
0eb97f38d2 sysctl: Correct error return from get_subdir
When insert_header fails ensure we return the proper error value
from get_subdir.  In practice nothing cares, but there is no
need to be sloppy.

Reported-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:40 -08:00
Eric W. Biederman
51f72f4a0f sysctl: An easier to read version of find_subdir
Suggested-by:  Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01 19:20:30 -08:00
Oleg Nesterov
6d08f2c713 proc: make sure mem_open() doesn't pin the target's memory
Once /proc/pid/mem is opened, the memory can't be released until
mem_release() even if its owner exits.

Change mem_open() to do atomic_inc(mm_count) + mmput(), this only
pins mm_struct. Change mem_rw() to do atomic_inc_not_zero(mm_count)
before access_remote_vm(), this verifies that this mm is still alive.

I am not sure what should mem_rw() return if atomic_inc_not_zero()
fails. With this patch it returns zero to match the "mm == NULL" case,
may be it should return -EINVAL like it did before e268337d.

Perhaps it makes sense to add the additional fatal_signal_pending()
check into the main loop, to ensure we do not hold this memory if
the target task was oom-killed.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Oleg Nesterov
572d34b946 proc: unify mem_read() and mem_write()
No functional changes, cleanup and preparation.

mem_read() and mem_write() are very similar. Move this code into the
new common helper, mem_rw(), which takes the additional "int write"
argument.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Oleg Nesterov
71879d3cb3 proc: mem_release() should check mm != NULL
mem_release() can hit mm == NULL, add the necessary check.

Cc: stable@kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 14:39:01 -08:00
Artem Bityutskiy
7d73101921 mtd: fix merge conflict resolution breakage
This patch fixes merge conflict resolution breakage introduced by merge
d3712b9dfc ("Merge tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream").

The commit changed 'mtd_can_have_bb()' function and made it always
return zero, which is incorrect.  Instead, we need it to return whether
the underlying flash device can have bad eraseblocks or not.  UBI needs
this information because it affects how it handles the underlying flash.
E.g., if the underlying flash is NOR, it cannot have bad blocks and any
write or erase error is fatal, and all we can do is to switch to R/O
mode.  We do not need to reserve a pool of good eraseblocks for bad
eraseblocks handling, and so on.

This patch also removes 'mtd_can_have_bb()' invocations from Logfs to
ensure correct Logfs behavior.

I've tested that with this patch UBI works on top of NOR and NAND
flashes emulated by mtdram and nandsim correspondingly.

This patch is based on patch from Linus Torvalds.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jörn Engel <joern@logfs.org>
Acked-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01 11:10:24 -08:00
Wu Fengguang
15eb77a07c writeback: fix NULL bdi->dev in trace writeback_single_inode
bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
tearing down the original bdi. Fix trace_writeback_single_inode to
use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
teared down bdi.

Cc: <stable@kernel.org>
Reported-by: Rabin Vincent <rabin@rab.in>
Tested-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2012-02-01 16:53:40 +08:00
Chris Mason
d98456fcaf Btrfs: don't reserve data with extents locked in btrfs_fallocate
btrfs_fallocate tries to allocate space only if ranges in the file don't
already exist.  But the enospc checks it does are not allowed with
extents locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-31 20:27:41 -05:00
Trond Myklebust
a4980e7840 NFSv4: ACCESS validation doesn't require a full attribute refresh
We only really need to check the change attribute, so let's just use the
server->cache_consistency_bitmask.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:24 -05:00
Trond Myklebust
8b7e3f49dd NFSv4: Don't decode fs_locations if we didn't ask for them...
Currently, the server can potentially cause us to Oops by returning an
fs_locations request that we didn't actually request.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:23 -05:00
Jeff Layton
c15c928f36 nfs: remove unneeded NULL pointer check in nfs4_remote_mount
"data" is never NULL here.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:23 -05:00
Benny Halevy
d36b7cf7c6 pnfs: clean up initiate_file_draining layout lookup
Fixes the following compiler warning:

fs/nfs/callback_proc.c: In function 'do_callback_layoutrecall':
fs/nfs/callback_proc.c:115:26: warning: 'lo' may be used uninitialized in this function

Reported-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:22 -05:00
Trond Myklebust
7d9dea915f NFS: Use kcalloc() when allocating arrays
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:22 -05:00
Trond Myklebust
4601df20fb NFSv4: Avoid thundering herd issues with nfs_release_seqid
Store a pointer to the rpc_task in struct nfs_seqid so that we can wake up
only that request that is able to grab the lock after we've released it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:21 -05:00
Trond Myklebust
a613fa168a SUNRPC: constify the rpc_program
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:20 -05:00
Stanislav Kinsbursky
4cb54ca206 SUNRPC: search for service transports in network namespace context
Service transports are parametrized by network namespace. And thus lookup of
transport instance have to take network namespace into account.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
2012-01-31 19:28:19 -05:00
Stanislav Kinsbursky
babea479b7 NFS: remove unused nfs4_find_client_no_ident function
Looks like this function survived after some cleanup patch without a reason.
Now it's not called or referenced and I believe, that it can be simply removed.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:19 -05:00
Stanislav Kinsbursky
246590f56c SUNRPC: register service stats /proc entries in passed network namespace context
This patch makes it possible to create NFSd program entry ("/proc/net/rpc/nfsd")
in passed network namespace context instead of hard-coded "init_net".

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:18 -05:00
Stanislav Kinsbursky
ec7652aaf2 SUNRPC: register RPC stats /proc entries in passed network namespace context
This patch makes it possible to create NFS program entry ("/proc/net/rpc/nfs")
in passed network namespace context instead of hard-coded "init_net".

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:17 -05:00
Stanislav Kinsbursky
170942726b NFS: decode destination address in proper network namespace context
This patch replaces "init_net" with NFS client's owner net in rpc_pton() call
in decode_ds_addr().

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:17 -05:00
Stanislav Kinsbursky
599ec129c2 NFS: parse DNS cache in proper network namespace context
This patch replaces "init_net" with cache's owner net in rpc_pton() call.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:17 -05:00
Stanislav Kinsbursky
5ecebb7c7f SUNRPC: unregister service on creation in current network namespace
On service shutdown we can be sure, that no more users of it left except
current. Thus it looks like using current network namespace context is safe in
this case.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:14 -05:00
Stanislav Kinsbursky
f2ac4dc911 SUNRPC: parametrize rpc_uaddr2sockaddr() by network context
Parametrize rpc_uaddr2sockaddr() by network context and thus force it's callers to pass
in network context instead of using hard-coded "init_net".

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:12 -05:00
Stanislav Kinsbursky
90100b1766 SUNRPC: parametrize rpc_pton() by network context
Parametrize rpc_pton() by network context and thus force it's callers to pass
in network context instead of using hard-coded "init_net".

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:12 -05:00
Trond Myklebust
961a828df6 SUNRPC: Fix potential races in xprt_lock_write_next()
We have to ensure that the wake up from the waitqueue and the assignment
of xprt->snd_task are atomic. We can do this by assigning the snd_task
while under the waitqueue spinlock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:08 -05:00
Trond Myklebust
2aeb98f498 NFS: Ensure that mmapped pages remain stable during writeback
Ensure that nfs_vm_page_mkwrite() waits for the page writeback to
complete before the application is allowed to modify page
contents.
The main reason for wanting to do this in NFS is to ensure that the
server doesn't get confused if we have to resend the RPC request
due to a dropped/missed reply.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:08 -05:00
Trond Myklebust
536e43d12b NFS: Optimise away unnecessary setattrs for open(O_TRUNC);
Currently, we will correctly optimise away a truncate that doesn't
change the file size. However, in the case of open(O_TRUNC), we
also want to optimise away the time changes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:07 -05:00
Trond Myklebust
48c22eb210 NFS: Move struct nfs_unique_id into struct nfs_seqid_counter
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:07 -05:00
Trond Myklebust
7ba127ab9f NFSv4: Move contents of struct rpc_sequence into struct nfs_seqid_counter
Clean up.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:06 -05:00
Trond Myklebust
9d12b216aa NFSv41: Add a new helper nfs4_init_sequence()
Clean up

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 19:28:06 -05:00
Trond Myklebust
d2d7ce28a2 NFSv4: Replace lock_owner->ld_id with an ida based allocator
Again, We're unlikely to ever need more than 2^31 simultaneous lock
owners, so let's replace the custom allocator.

Now that there are no more users, we can also get rid of the custom
allocator code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:28 -05:00
Trond Myklebust
9157c31dd6 NFSv4: Replace state_owner->so_owner_id with an ida based allocator
We're unlikely to ever need more than 2^31 simultaneous open owners,
so let's replace the custom allocator with the generic ida allocator.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:28 -05:00
Trond Myklebust
d1e284d50a NFSv4: Clean up nfs4_get_state_owner
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:28 -05:00
Trond Myklebust
1313e6034a NFS: Remove unnecessary includes from linux/nfs_fs_i.h
Also from linux/nfs_xdr.h.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:28 -05:00
Stanislav Kinsbursky
2561d618ff NFS: remove RPC PipeFS mount point reference from blocklayout routines
This is a cleanup patch. We don't need this reference anymore, because
blocklayout pipes dentries now creates and destroys in per-net operations and
on PipeFS mount/umount notification.
Note that nfs4blocklayout_register_net() now returns 0 instead of -ENOENT in
case of PipeFS superblock absence. This is ok, because blocklayout pipe dentry
will be created on PipeFS mount event.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
627f30668f NFS: blocklayout PipeFS notifier introduced
This patch subscribes blocklayout pipes to RPC pipefs notifications. Notifier
is registering on blocklayout module load. This notifier callback is
responsible for creation/destruction of PipeFS blocklayout pipe dentry.
Note that no locking required in notifier callback because PipeFS superblock
pointer is passed as an argument from it's creation or destruction routine and
thus we can be sure about it's validity.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
9e2e74dba6 NFS: blocklayout pipe creation per network namespace context introduced
This patch implements blocklayout pipe creation and registration per each
existent network namespace.
This was achived by registering NFS per-net operations, responsible for
blocklayout pipe allocation/register and unregister/destruction instead of
initialization and destruction of static "bl_device_pipe" pipe (this one was
removed).
Note, than pointer to network blocklayout pipe is stored in per-net "nfs_net"
structure, because allocating of one more per-net structure for blocklayout
module looks redundant.
This patch also changes dev_remove() function prototype (and all it's callers,
where it' requied) by adding network namespace pointer parameter, which is used
to discover proper blocklayout pipe for rpc_queue_upcall() call.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
332dfab6f4 NFS: handle blocklayout pipe PipeFS dentry by network namespace aware routines
This patch makes blocklayout pipe dentry allocated and destroyed in network
namespace context by PipeFS network namespace aware routines.
Network namespace context is obtained from nfs_client structure.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
eee17325f1 NFS: idmap PipeFS notifier introduced
v2:
1) Added "nfs_idmap_init" and "nfs_idmap_quit" definitions for kernels built
without CONFIG_NFS_V4 option set.

This patch subscribes NFS clients to RPC pipefs notifications. Idmap notifier
is registering on NFS module load. This notifier callback is responsible for
creation/destruction of PipeFS idmap pipe dentry for NFS4 clients.

Since ipdmap pipe is created in rpc client pipefs directory, we have make sure,
that this directory has been created already. IOW RPC client notifier callback
has been called already. To achive this, PipeFS notifier priorities has been
introduced (RPC clients notifier priority is greater than NFS idmap one).
But this approach gives another problem: unlink for RPC client directory will
be called before NFS idmap pipe unlink on UMOUNT event and will fail, because
directory is not empty.
The solution, introduced in this patch, is to try to remove client directory
once again after idmap pipe was unlinked. This looks like ugly hack, so
probably it should be replaced in some more elegant way.

Note that no locking required in notifier callback because PipeFS superblock
pointer is passed as an argument from it's creation or destruction routine and
thus we can be sure about it's validity.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
4929d1d33f NFS: handle NFS idmap pipe PipeFS dentries by network namespace aware routines
This patch makes NFS idmap pipes dentries allocated and destroyed in network
namespace context by PipeFS network namespace aware routines.
Network namespace context is obtained from nfs_client structure.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
7bb782c6ac NFS: create callback transports in parent transport network namespace
This patch replaces static "init_net" references with parent transport xprt_net
reference. Thus callback transports will be created in the same network
namespace as respective NFS mount point was created.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
6d59b8d599 NFS: pass NFS client owner network namespace to RPC client creation routine
This patch replaces static "init_net" with nfs_client->net pointer in RPC
client creation calls.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:27 -05:00
Stanislav Kinsbursky
e50a7a1a42 NFS: make NFS client allocated per network namespace context
This patch adds new net variable to nfs_client structure. This variable is set
on NFS client creation and cheched during matching NFS client search.
Initially current->nsproxy->net_ns is used as network namespace owner for new
NFS client to create. This network namespace pointer is set during mount
options parsing and thus can be passed from user-spave utils in future if will
be necessary.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
39cb67b9a0 NFS: remove RPC PipeFS mount point references from NFS cache routines
This is a cleanup patch. We don't need this reference anymore, because DNS
resolver cache now creates it's dentries in per-net operations and on PipeFS
mount/umount notification.
Note that nfs_cache_register_net() now returns 0 instead of -ENOENT in case of
PiepFS superblock absence. This is ok, Dns resolver cache will be regestered on
PipeFS mount event.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
9df69c81b4 NFS: DNS resolver PipeFS notifier introduced
This patch subscribes DNS resolver caches to RPC pipefs notifications. Notifier
is registering on NFS module load. This notifier callback is responsible for
creation/destruction of PipeFS DNS resolver cache directory.
Note that no locking required in notifier callback because PipeFS superblock
pointer is passed as an argument from it's creation or destruction routine and
thus we can be sure about it's validity.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
1b340d0118 NFS: DNS resolver cache per network namespace context introduced
This patch implements DNS resolver cache creation and registration for each
alive network namespace context.
This was done by registering NFS per-net operations, responsible for DNS cache
allocation/register and unregister/destructioning instead of initialization and
destruction of static "nfs_dns_resolve" cache detail (this one was removed).
Pointer to network dns resolver cache is stored in new per-net "nfs_net"
structure.
This patch also changes nfs_dns_resolve_name() function prototype (and it's
calls) by adding network pointer parameter, which is used to get proper DNS
resolver cache pointer for do_cache_lookup_wait() call.

Note: empty nfs_dns_resolver_init() and nfs_dns_resolver_destroy() functions
will be used in next patch in the series.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
5c1cacb175 NFS: handle NFS caches dentries by network namespace aware routines
This patch makes NFS caches PipeFS dentries allocated and destroyed in network
namespace context by PipeFS network namespace aware routines.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
9222b95506 NFS: split cache creation and PipeFS registration
This precursor patch splits NFS cache creation and PipeFS registartion.
It's required for latter split of NFS DNS resolver cache creation per network
namespace context and PipeFS registration/unregistration on MOUNT/UMOUNT
events.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
820f9442e7 SUNRPC: split cache creation and PipeFS registration
This precursor patch splits SUNRPC cache creation and PipeFS registartion.
It's required for latter split of NFS DNS resolver cache creation per network
namespace context and PipeFS registration/unregistration on MOUNT/UMOUNT
events.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
30507f58ce SUNRPC: remove RPC PipeFS mount point reference from RPC client
This is a cleanup patch. We don't need this reference anymore.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:26 -05:00
Stanislav Kinsbursky
0157d021d2 SUNRPC: handle RPC client pipefs dentries by network namespace aware routines
v2:
1) "Over-put" of PipeFS mount point fixed. Fix is ugly, but allows to bisect
the patch set. And it will be removed later in the series.

This patch makes RPC clients PipeFs dentries allocations in it's owner network
namespace context.
RPC client pipefs dentries creation logic has been changed:
1) Pipefs dentries creation by sb was moved to separated function, which will
be used for handling PipeFS mount notification.
2) Initial value of RPC client PipeFS dir dentry is set no NULL now.

RPC client pipefs dentries cleanup logic has been changed:
1) Cleanup is done now in separated rpc_remove_pipedir() function, which takes
care about pipefs superblock locking.

Also this patch removes slashes from cb_program.pipe_dir_name and from
NFS_PIPE_DIRNAME to make rpc_d_lookup_sb() work. This doesn't affect
vfs_path_lookup() results in nfs4blocklayout_init() since this slash is cutted
off anyway in link_path_walk().

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:25 -05:00
Stanislav Kinsbursky
c239d83b99 SUNRPC: split SUNPRC PipeFS dentry and private pipe data creation
This patch is a final step towards to removing PipeFS inode references from
kernel code other than PipeFS itself. It makes all kernel SUNRPC PipeFS users
depends on pipe private data, which state depend on their specific operations,
etc.
This patch completes SUNRPC PipeFS preparations and allows to create pipe
private data and PipeFS dentries independently.
Next step will be making SUNPRC PipeFS dentries allocated by SUNRPC PipeFS
network namespace aware routines.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:25 -05:00
Stanislav Kinsbursky
d706ed1f50 SUNPRC: cleanup RPC PipeFS pipes upcall interface
RPC pipe upcall doesn't requires only private pipe data. Thus RPC inode
references in this code can be removed.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-31 18:20:25 -05:00
Mitsuo Hayasaka
021000e59c xfs: show uuid when mount fails due to duplicate uuid
When a system tries to mount a filesystem (FS) using UUID, the xfs
returns -EINVAL and shows a message if a FS with the same UUID has
been already mounted. It is useful to output the duplicate UUID
with it.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-31 13:37:33 -06:00
Eric W. Biederman
d5c38b137a sysfs: Update the name hash when renaming sysfs entries
This fixes a bug introduced with sysfs name hashes where renaming a
network device appears to succeed but silently makes the sysfs files for
that network device inaccessible.

In at least one configuration this bug has stopped networking from
coming up during boot.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-31 11:30:48 -08:00
Steve French
2a73ca8208 [CIFS] Update cifs Kconfig title to match removal of experimental dependency
Removed the dependency on CONFIG_EXPERIMENTAL but forgot to update
the text description to be consistent.

Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-31 12:51:24 -06:00
Mitsuo Hayasaka
4505360376 xfs: pass KM_SLEEP flag to kmem_realloc() in xlog_recover_add_to_cnt_trans()
The kmem_realloc() in xfs is given KM_* memory allocation flags. And it
allocates memory using kmalloc() after they are converted to gfp_mask
flags. In xlog_recover_add_to_cont_trans(), 0u is passed to kmem_realloc(),
instead of them. I guess it is preferred to use them, and here memory must
be allocated but don't have to be done with GFP_ATOMIC. So, this patch
changes it to KM_SLEEP.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-31 12:11:18 -06:00
Linus Torvalds
d3712b9dfc Pull request from git://github.com/prasad-joshi/logfs_upstream.git
There are few important bug fixes for LogFS
 
 Shortlog:
 Joern Engel (5):
      logfs: Prevent memory corruption
      logfs: remove useless BUG_ON
      logfs: Free areas before calling generic_shutdown_super()
      logfs: Grow inode in delete path
      Logfs: Allow NULL block_isbad() methods
 
 Prasad Joshi (5):
      logfs: update page reference count for pined pages
      logfs: take write mutex lock during fsync and sync
      logfs: set superblock shutdown flag after generic sb shutdown
      logfs: Propagate page parameter to __logfs_write_inode
      MAINTAINERS: Add Prasad Joshi in LogFS maintiners
 
 Diffstat:
  MAINTAINERS          |    1 +
  fs/logfs/dev_mtd.c   |   26 +++++++++++-------------
  fs/logfs/dir.c       |    2 +-
  fs/logfs/file.c      |    2 +
  fs/logfs/gc.c        |    2 +-
  fs/logfs/inode.c     |    4 ++-
  fs/logfs/journal.c   |    1 -
  fs/logfs/logfs.h     |    5 +++-
  fs/logfs/readwrite.c |   51 +++++++++++++++++++++++++++++++++----------------
  fs/logfs/segment.c   |   51 ++++++++++++++++++++++++++++++++++++++-----------
  fs/logfs/super.c     |    3 +-
  11 files changed, 99 insertions(+), 49 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPKByhAAoJEDFA/f+3K+ZNQSUP/3gACcIwcsl+FnXPWBtz9XIG
 g0DjXoRDd/sR0u25nLgjCVdBJgx5FVEyA+PvLgvvUU2KCAsqI5F/EQ+fLJs21YEN
 TzepBO5aHtFZbNEjo6WiXOlDbBePTtk44WrN6jqoCHM/aDeT4Wof3NZBmHWNN1PX
 B2RtEZ0ypJ7/b1OY2LUNcQfTaJXNgVoP8Hkx4KGY5LUVxVrBXxvDTU7YbkS8a+ys
 1Yje/EQ4XD4RyZB42TmFEuTenvGPRgMGVFdnkJKuON8EmJQ8Hc61jEf5d7Q8sWef
 dH5F/ptoAaR9a9LbbO8LoYuBZ8MR8848NPsrNPpr/gWntj46Z79yII8Jr7YoSDyw
 zq5G2dZbwlbVrtVWKGae47THkNB8bljR/g4cijvPAkvuIAku6mg+dgjVHAhZ/t+J
 xu8+Gy2sWHUH2gmoSXuoNyppOvYpPIRd5RB16PizMH3bw+sMad2K8/rfOKnmF1/r
 HTM2jZ5bDcHVDjSuVI6u2m/mQX+PmPXUTffreaFXuSI75YpT0dqN3nponTX4EgFI
 Ad9ZBQvdg8w1LGDsNxIAaqrGx4Q87RxqfUV4W/wo6N8gKsp+I2y4GtYMeD/CEKyi
 wncKg10YwoMXZj7cBAkWgPlgrOBYCPwYZc/1DVRHvqrHo/m13SJrWDKkNKVvoXzH
 2y4Tfi5w1WDRUT7yeoyK
 =TA1A
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream

There are few important bug fixes for LogFS

* tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream:
  Logfs: Allow NULL block_isbad() methods
  logfs: Grow inode in delete path
  logfs: Free areas before calling generic_shutdown_super()
  logfs: remove useless BUG_ON
  MAINTAINERS: Add Prasad Joshi in LogFS maintiners
  logfs: Propagate page parameter to __logfs_write_inode
  logfs: set superblock shutdown flag after generic sb shutdown
  logfs: take write mutex lock during fsync and sync
  logfs: Prevent memory corruption
  logfs: update page reference count for pined pages

Fix up conflict in fs/logfs/dev_mtd.c due to semantic change in what
"mtd->block_isbad" means in commit f2933e86ad: "Logfs: Allow NULL
block_isbad() methods" clashing with the abstraction changes in the
commits 7086c19d07: "mtd: introduce mtd_block_isbad interface" and
d58b27ed58: "logfs: do not use 'mtd->block_isbad' directly".

This resolution takes the semantics from commit f2933e86ad, and just
makes mtd_block_isbad() return zero (false) if the 'block_isbad'
function is NULL.  But that also means that now "mtd_can_have_bb()"
always returns 0.

Now, "mtd_block_markbad()" will obviously return an error if the
low-level driver doesn't support bad blocks, so this is somewhat
non-symmetric, but it actually makes sense if a NULL "block_isbad"
function is considered to mean "I assume that all my blocks are always
good".
2012-01-31 09:23:59 -08:00
Randy Dunlap
000f9bb839 cifs: fix printk format warnings
Fix printk format warnings for ssize_t variables:

fs/cifs/connect.c:2145:3: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'
fs/cifs/connect.c:2152:3: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'
fs/cifs/connect.c:2160:3: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'
fs/cifs/connect.c:2170:3: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc:	linux-cifs@vger.kernel.org
2012-01-31 07:42:08 -06:00
Dan Carpenter
4991a5faab cifs: check offset in decode_ntlmssp_challenge()
We should check that we're not copying memory from beyond the end of the
blob.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-01-31 07:42:06 -06:00
Dan Carpenter
1347440db6 sysctl: fix memset parameters in setup_sysctl_set()
The current code is a nop.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30 19:14:08 -08:00
Dan Carpenter
4798178709 sysctl: remove an unused variable
"links" is never used, so we can remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30 19:13:46 -08:00
Linus Torvalds
0a96265754 Here are some patches for the 3.3-rc1 tree.
It contains the removal of the sysdev code, now that all users of it are
 gone, as well as some sysfs bugfixes that have been reported by users.
 There are also some documentation updates here as well.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk8jKW4ACgkQMUfUDdst+ynAUwCfVWwHJxpb4DSSMVZhGOnHMQrL
 ZjIAn00gPeSs5u8y1nPvFrFikbon4FDs
 =bzVy
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.3-rc1-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Here are some patches for the 3.3-rc1 tree.

It contains the removal of the sysdev code, now that all users of it are
gone, as well as some sysfs bugfixes that have been reported by users.
There are also some documentation updates here as well.

* tag 'driver-core-3.3-rc1-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: Complain bitterly about attempts to remove files from nonexistent directories.
  stable: update documentation to ask for kernel version
  base/core.c:fix typo in comment in function device_add
  Documentation: devres: add allocation functions to list of supported calls
  Documentation update for the driver model core
  kernel-doc: fix new warnings in driver-core
  kernel-doc: fix new warnings in debugfs
  kernel-doc: fix new warnings in device.h
  driver core: remove drivers/base/sys.c and include/linux/sysdev.h
2012-01-28 18:20:48 -08:00
Linus Torvalds
67d2433ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix reservations in btrfs_page_mkwrite
  Btrfs: advance window_start if we're using a bitmap
  btrfs: mask out gfp flags in releasepage
  Btrfs: fix enospc error caused by wrong checks of the chunk
  Btrfs: do not defrag a file partially
  Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
  Btrfs: use cluster->window_start when allocating from a cluster bitmap
  Btrfs: Check for NULL page in extent_range_uptodate
  btrfs: Fix busyloops in transaction waiting code
  Btrfs: make sure a bitmap has enough bytes
  Btrfs: fix uninit warning in backref.c
2012-01-28 17:00:19 -08:00
Joern Engel
f2933e86ad Logfs: Allow NULL block_isbad() methods
Not all mtd drivers define block_isbad().  Let's assume no bad blocks
instead of refusing to mount.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:43:40 +05:30
Joern Engel
bbe0138712 logfs: Grow inode in delete path
Can be necessary if an inode gets deleted (through -ENOSPC) before being
written.  Might be better to move this into logfs_write_rec(), but for
now go with the stupid&safe patch.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:43:07 +05:30
Joern Engel
1bcceaff8c logfs: Free areas before calling generic_shutdown_super()
Or hit an assertion in map_invalidatepage() instead.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:42:39 +05:30
Joern Engel
6c69494f6b logfs: remove useless BUG_ON
It prevents write sizes >4k.

Signed-off-by: Joern Engel <joern@logfs.org>
2012-01-28 11:41:56 +05:30
Prasad Joshi
0bd90387ed logfs: Propagate page parameter to __logfs_write_inode
During GC LogFS has to rewrite each valid block to a separate segment.
Rewrite operation reads data from an old segment and writes it to a
newly allocated segment. Since every write operation changes data
block pointers maintained in inode, inode should also be rewritten.

In GC path to avoid AB-BA deadlock LogFS marks a page with
PG_pre_locked in addition to locking the page (PG_locked). The page
lock is ignored iff the page is pre-locked.

LogFS uses a special file called segment file. The segment file
maintains an 8 bytes entry for every segment. It keeps track of erase
count, level etc. for every segment.

Bad things happen with a segment belonging to the segment file is GCed

 ------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/readwrite.c:297!
invalid opcode: 0000 [#1] SMP
Modules linked in: logfs joydev usbhid hid psmouse e1000 i2c_piix4
		serio_raw [last unloaded: logfs]
Pid: 20161, comm: mount Not tainted 3.1.0-rc3+ #3 innotek GmbH
		VirtualBox
EIP: 0060:[<f809132a>] EFLAGS: 00010292 CPU: 0
EIP is at logfs_lock_write_page+0x6a/0x70 [logfs]
EAX: 00000027 EBX: f73f5b20 ECX: c16007c8 EDX: 00000094
ESI: 00000000 EDI: e59be6e4 EBP: c7337b28 ESP: c7337b18
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mount (pid: 20161, ti=c7336000 task=eb323f70 task.ti=c7336000)
Stack:
f8099a3d c7337b24 f73f5b20 00001002 c7337b50 f8091f6d f8099a4d f80994e4
00000003 00000000 c7337b68 00000000 c67e4400 00001000 c7337b80 f80935e5
00000000 00000000 00000000 00000000 e1fcf000 0000000f e59be618 c70bf900
Call Trace:
[<f8091f6d>] logfs_get_write_page.clone.16+0xdd/0x100 [logfs]
[<f80935e5>] logfs_mod_segment_entry+0x55/0x110 [logfs]
[<f809460d>] logfs_get_segment_entry+0x1d/0x20 [logfs]
[<f8091060>] ? logfs_cleanup_journal+0x50/0x50 [logfs]
[<f809521b>] ostore_get_erase_count+0x1b/0x40 [logfs]
[<f80965b8>] logfs_open_area+0xc8/0x150 [logfs]
[<c141a7ec>] ? kmemleak_alloc+0x2c/0x60
[<f809668e>] __logfs_segment_write.clone.16+0x4e/0x1b0 [logfs]
[<c10dd563>] ? mempool_kmalloc+0x13/0x20
[<c10dd563>] ? mempool_kmalloc+0x13/0x20
[<f809696f>] logfs_segment_write+0x17f/0x1d0 [logfs]
[<f8092e8c>] logfs_write_i0+0x11c/0x180 [logfs]
[<f8092f35>] logfs_write_direct+0x45/0x90 [logfs]
[<f80934cd>] __logfs_write_buf+0xbd/0xf0 [logfs]
[<c102900e>] ? kmap_atomic_prot+0x4e/0xe0
[<f809424b>] logfs_write_buf+0x3b/0x60 [logfs]
[<f80947a9>] __logfs_write_inode+0xa9/0x110 [logfs]
[<f8094cb0>] logfs_rewrite_block+0xc0/0x110 [logfs]
[<f8095300>] ? get_mapping_page+0x10/0x60 [logfs]
[<f8095aa0>] ? logfs_load_object_aliases+0x2e0/0x2f0 [logfs]
[<f808e57d>] logfs_gc_segment+0x2ad/0x310 [logfs]
[<f808e62a>] __logfs_gc_once+0x4a/0x80 [logfs]
[<f808ed43>] logfs_gc_pass+0x683/0x6a0 [logfs]
[<f8097a89>] logfs_mount+0x5a9/0x680 [logfs]
[<c1126b21>] mount_fs+0x21/0xd0
[<c10f6f6f>] ? __alloc_percpu+0xf/0x20
[<c113da41>] ? alloc_vfsmnt+0xb1/0x130
[<c113db4b>] vfs_kern_mount+0x4b/0xa0
[<c113e06e>] do_kern_mount+0x3e/0xe0
[<c113f60d>] do_mount+0x34d/0x670
[<c10f2749>] ? strndup_user+0x49/0x70
[<c113fcab>] sys_mount+0x6b/0xa0
[<c142d87c>] syscall_call+0x7/0xb
Code: f8 e8 8b 93 39 c9 8b 45 f8 3e 0f ba 28 00 19 d2 85 d2 74 ca eb d0 0f 0b 8d 45 fc 89 44 24 04 c7 04 24 3d 9a 09 f8 e8 09 92 39 c9 <0f> 0b 8d 74 26 00 55 89 e5 3e 8d 74 26 00 8b 10 80 e6 01 74 09
EIP: [<f809132a>] logfs_lock_write_page+0x6a/0x70 [logfs] SS:ESP 0068:c7337b18
---[ end trace 96e67d5b3aa3d6ca ]---

The patch passes locked page to __logfs_write_inode. It calls function
logfs_get_wblocks() to pre-lock the page. This ensures any further
attempts to lock the page are ignored (esp from get_erase_count).

Acked-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:38:25 +05:30
Prasad Joshi
ecfd890991 logfs: set superblock shutdown flag after generic sb shutdown
While unmounting the file system LogFS calls generic_shutdown_super.
The function does file system independent superblock shutdown.
However, it might result in call file system specific inode eviction.

LogFS marks FS shutting down by setting bit LOGFS_SB_FLAG_SHUTDOWN in
super->s_flags. Since, inode eviction might call truncate on inode,
following BUG is observed when file system is unmounted:

------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/segment.c:362!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU 3
Modules linked in: logfs binfmt_misc ppdev virtio_blk parport_pc lp
	parport psmouse floppy virtio_pci serio_raw virtio_ring virtio

Pid: 1933, comm: umount Not tainted 3.0.0+ #4 Bochs Bochs
RIP: 0010:[<ffffffffa008c841>]  [<ffffffffa008c841>]
		logfs_segment_write+0x211/0x230 [logfs]
RSP: 0018:ffff880062d7b9e8  EFLAGS: 00010202
RAX: 000000000000000e RBX: ffff88006eca9000 RCX: 0000000000000000
RDX: ffff88006fd87c40 RSI: ffffea00014ff468 RDI: ffff88007b68e000
RBP: ffff880062d7ba48 R08: 8000000020451430 R09: 0000000000000000
R10: dead000000100100 R11: 0000000000000000 R12: ffff88006fd87c40
R13: ffffea00014ff468 R14: ffff88005ad0a460 R15: 0000000000000000
FS:  00007f25d50ea760(0000) GS:ffff88007fd80000(0000)
	knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000d05e48 CR3: 0000000062c72000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 1933, threadinfo ffff880062d7a000,
	task ffff880070b44500)
Stack:
ffff880062d7ba38 ffff88005ad0a508 0000000000001000 0000000000000000
8000000020451430 ffffea00014ff468 ffff880062d7ba48 ffff88005ad0a460
ffff880062d7bad8 ffffea00014ff468 ffff88006fd87c40 0000000000000000
Call Trace:
[<ffffffffa0088fee>] logfs_write_i0+0x12e/0x190 [logfs]
[<ffffffffa0089360>] __logfs_write_rec+0x140/0x220 [logfs]
[<ffffffffa0089312>] __logfs_write_rec+0xf2/0x220 [logfs]
[<ffffffffa00894a4>] logfs_write_rec+0x64/0xd0 [logfs]
[<ffffffffa0089616>] __logfs_write_buf+0x106/0x110 [logfs]
[<ffffffffa008a19e>] logfs_write_buf+0x4e/0x80 [logfs]
[<ffffffffa008a6b8>] __logfs_write_inode+0x98/0x110 [logfs]
[<ffffffffa008a7c4>] logfs_truncate+0x54/0x290 [logfs]
[<ffffffffa008abfc>] logfs_evict_inode+0xdc/0x190 [logfs]
[<ffffffff8115eef5>] evict+0x85/0x170
[<ffffffff8115f126>] iput+0xe6/0x1b0
[<ffffffff8115b4a8>] shrink_dcache_for_umount_subtree+0x218/0x280
[<ffffffff8115ce91>] shrink_dcache_for_umount+0x51/0x90
[<ffffffff8114796c>] generic_shutdown_super+0x2c/0x100
[<ffffffffa008cc47>] logfs_kill_sb+0x57/0xf0 [logfs]
[<ffffffff81147de5>] deactivate_locked_super+0x45/0x70
[<ffffffff811487ea>] deactivate_super+0x4a/0x70
[<ffffffff81163934>] mntput_no_expire+0xa4/0xf0
[<ffffffff8116469f>] sys_umount+0x6f/0x380
[<ffffffff814dd46b>] system_call_fastpath+0x16/0x1b
Code: 55 c8 49 8d b6 a8 00 00 00 45 89 f9 45 89 e8 4c 89 e1 4c 89 55
b8 c7 04 24 00 00 00 00 e8 68 fc ff ff 4c 8b 55 b8 e9 3c ff ff ff <0f>
0b 0f 0b c7 45 c0 00 00 00 00 e9 44 fe ff ff 66 66 66 66 66
RIP  [<ffffffffa008c841>] logfs_segment_write+0x211/0x230 [logfs]
RSP <ffff880062d7b9e8>
---[ end trace fe6b040cea952290 ]---

Therefore, move super->s_flags setting after the fs-indenpendent work
has been finished.

Reviewed-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:37:47 +05:30
Prasad Joshi
13ced29cb2 logfs: take write mutex lock during fsync and sync
LogFS uses super->s_write_mutex while writing data to disk. Taking the
same mutex lock in sync and fsync code path solves the following BUG:

------------[ cut here ]------------
kernel BUG at /home/prasad/logfs/dev_bdev.c:134!

Pid: 2387, comm: flush-253:16 Not tainted 3.0.0+ #4 Bochs Bochs
RIP: 0010:[<ffffffffa007deed>]  [<ffffffffa007deed>]
                bdev_writeseg+0x25d/0x270 [logfs]
Call Trace:
[<ffffffffa007c381>] logfs_open_area+0x91/0x150 [logfs]
[<ffffffff8128dcb2>] ? find_level.clone.9+0x62/0x100
[<ffffffffa007c49c>] __logfs_segment_write.clone.20+0x5c/0x190 [logfs]
[<ffffffff810ef005>] ? mempool_kmalloc+0x15/0x20
[<ffffffff810ef383>] ? mempool_alloc+0x53/0x130
[<ffffffffa007c7a4>] logfs_segment_write+0x1d4/0x230 [logfs]
[<ffffffffa0078f8e>] logfs_write_i0+0x12e/0x190 [logfs]
[<ffffffffa0079300>] __logfs_write_rec+0x140/0x220 [logfs]
[<ffffffffa0079444>] logfs_write_rec+0x64/0xd0 [logfs]
[<ffffffffa00795b6>] __logfs_write_buf+0x106/0x110 [logfs]
[<ffffffffa007a13e>] logfs_write_buf+0x4e/0x80 [logfs]
[<ffffffffa0073e33>] __logfs_writepage+0x23/0x80 [logfs]
[<ffffffffa007410c>] logfs_writepage+0xdc/0x110 [logfs]
[<ffffffff810f5ba7>] __writepage+0x17/0x40
[<ffffffff810f6208>] write_cache_pages+0x208/0x4f0
[<ffffffff810f5b90>] ? set_page_dirty+0x70/0x70
[<ffffffff810f653a>] generic_writepages+0x4a/0x70
[<ffffffff810f75d1>] do_writepages+0x21/0x40
[<ffffffff8116b9d1>] writeback_single_inode+0x101/0x250
[<ffffffff8116bdbd>] writeback_sb_inodes+0xed/0x1c0
[<ffffffff8116c5fb>] writeback_inodes_wb+0x7b/0x1e0
[<ffffffff8116cc23>] wb_writeback+0x4c3/0x530
[<ffffffff814d984d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff8116cd6b>] wb_do_writeback+0xdb/0x290
[<ffffffff814d984d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff814d6208>] ? _raw_spin_unlock_irqrestore+0x18/0x40
[<ffffffff8105aa5a>] ? del_timer+0x8a/0x120
[<ffffffff8116cfac>] bdi_writeback_thread+0x8c/0x2e0
[<ffffffff8116cf20>] ? wb_do_writeback+0x290/0x290
[<ffffffff8106d2e6>] kthread+0x96/0xa0
[<ffffffff814de514>] kernel_thread_helper+0x4/0x10
[<ffffffff8106d250>] ? kthread_worker_fn+0x190/0x190
[<ffffffff814de510>] ? gs_change+0xb/0xb
RIP  [<ffffffffa007deed>] bdev_writeseg+0x25d/0x270 [logfs]
---[ end trace 0211ad60a57657c4 ]---

Reviewed-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:36:06 +05:30
Joern Engel
934eed395d logfs: Prevent memory corruption
This is a bad one.  I wonder whether we were so far protected by
no_free_segments(sb) usually being smaller than LOGFS_NO_AREAS.

Found by Dan Carpenter <dan.carpenter@oracle.com> using smatch.

Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:24:21 +05:30
Prasad Joshi
96150606e2 logfs: update page reference count for pined pages
LogFS sets PG_private flag to indicate a pined page. We assumed that
marking a page as private is enough to ensure its existence. But
instead it is necessary to hold a reference count to the page.

The change resolves the following BUG

BUG: Bad page state in process flush-253:16  pfn:6a6d0
page flags: 0x100000000000808(uptodate|private)

Suggested-and-Acked-by: Joern Engel <joern@logfs.org>
Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
2012-01-28 11:23:10 +05:30
Chris Mason
9998eb7034 Btrfs: fix reservations in btrfs_page_mkwrite
Josef fixed btrfs_page_mkwrite to properly release reserved
extents if there was an error.  But if we fail to get a reservation
and we fail to dirty the inode (for ENOSPC reasons), we'll end up
trying to release a reservation we never had.

This makes sure we only release if we were able to reserve.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 10:44:44 -05:00
Josef Bacik
9b23062840 Btrfs: advance window_start if we're using a bitmap
If we span a long area in a bitmap we could end up taking a lot of time
searching to the next free area if we're searching from the original
window_start, so advance window_start in order to make sure we don't do any
superficial searching.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
David Sterba
0c4e538bcc btrfs: mask out gfp flags in releasepage
btree_releasepage is a callback and can be passed unknown gfp flags and then
they may end up in kmem_cache_alloc called from alloc_extent_state, slab
allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.

This may happen when btrfs is mounted from a loop device, which masks out
__GFP_IO flag. The check in try_release_extent_state

3399                 if ((mask & GFP_NOFS) == GFP_NOFS)
3400                         mask = GFP_NOFS;

will not work and passes unfiltered flags further resulting in crash at
mm/slab.c:2963

 [<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
 [<000000000024c810>] kmem_cache_alloc+0x204/0x294
 [<00000000001fd3c2>] mempool_alloc+0x52/0x170
 [<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
 [<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
 [<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
 [<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
 [<0000000000210d84>] shrink_page_list+0x6a0/0x724
 [<0000000000211394>] shrink_inactive_list+0x230/0x578
 [<0000000000211bb8>] shrink_list+0x6c/0x120
 [<0000000000211e4e>] shrink_zone+0x1e2/0x228
 [<0000000000211f24>] shrink_zones+0x90/0x254
 [<0000000000213410>] do_try_to_free_pages+0xac/0x420
 [<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
 [<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
 [<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Miao Xie
9e622d6bea Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.

Reproduce steps:
 # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
 # mount <test partition> /mnt
 # ulimit -n 102400
 # cd /mnt
 # sysbench --num-threads=1 --test=fileio --file-num=81920 \
 > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
 > --file-test-mode=seqwr prepare
 # sysbench --num-threads=1 --test=fileio --file-num=81920 \
 > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
 > --file-test-mode=seqwr run
 <soon later, BUG_ON() was triggered by enospc error>

The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.

One is
	if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.

The other is
	if (space_info->force_alloc)
		force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.

Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Liu Bo
7ec31b548a Btrfs: do not defrag a file partially
xfstests 218 complains that btrfs defrags a file partially:
 After: 1
 Write backwards sync, but contiguous - should defrag to 1 extent
 Before: 10
-After: 1
+After: 2

To fix this, we need to set max_to_defrag count properly.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Stefan Behrens
0b485143d8 Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
There have been 4 warnings on 32-bit build, they are herewith fixed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Josef Bacik
0b4a9d248f Btrfs: use cluster->window_start when allocating from a cluster bitmap
We specifically set window_start in the cluster struct to indicate where the
cluster starts in a bitmap, but we've been using min_start to indicate where
we're searching from.  This is usually the start of the blockgroup, so
essentially means we're constantly searching from the start of any bitmap we
find, which completely negates all the trouble we go to in order to setup a
cluster.  So start using window_start to make sure we actually use the area we
found.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Mitch Harder
8bedd51b61 Btrfs: Check for NULL page in extent_range_uptodate
A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors.  The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).

After testing this patch, the user reported that the error with
the NULL pointer oops was solved.  However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c

       for (i = start_i; i < num_pages; i++) {
               page = extent_buffer_page(eb, i);
               wait_on_page_locked(page);
               if (!PageUptodate(page))
                       ret = -EIO;
       }

This patch leaves the issue with the locked page yet to be resolved.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Jan Kara
6dd70ce4eb btrfs: Fix busyloops in transaction waiting code
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Josef Bacik
357b9784b7 Btrfs: make sure a bitmap has enough bytes
We have only been checking for min_bytes available in bitmap entries, but we
won't successfully setup a bitmap cluster unless it has at least bytes in the
bitmap, so in the common case min_bytes is 4k and we want something like 2MB, so
if there are a bunch of bitmap entries with less than 2mb's in them, we'll
search all them anyway, which is suboptimal.  Fix this check.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Jan Schmidt
b1375d64c5 Btrfs: fix uninit warning in backref.c
Added initialization with the declaration of ret. It isn't set later on the
switch-default branch (which should never be taken).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Ludwig Nussel
d6e486868c debugfs: add mode, uid and gid options
Cautious admins may want to restrict access to debugfs. Currently a
manual chown/chmod e.g. in an init script is needed to achieve that.
Distributions that want to make the mount options configurable need
to add extra config files. By allowing to set the root inode's uid,
gid and mode via mount options no such hacks are needed anymore.
Instead configuration becomes straight forward via fstab.

Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 11:28:49 -08:00
Linus Torvalds
aaad641ead Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
Quoth Ben Myers:
 "Please pull in the following bugfix for xfs.  We forgot to drop a lock on
  error in xfs_readlink.  It hasn't been through -next yet, but there is no
  -next tree tomorrow.  The fix is clear so I'm sending this request today."

* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: Fix missing xfs_iunlock() on error recovery path in xfs_readlink()
2012-01-25 15:36:44 -08:00
Li Wang
1589cb1a94 eCryptfs: move misleading function comments
The data encryption was moved from ecryptfs_write_end into
ecryptfs_writepage, this patch moves the corresponding function
comments to be consistent with the modification.

Signed-off-by: Li Wang <liwang@nudt.edu.cn>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-25 15:10:53 -08:00
Linus Torvalds
3074c0350b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Says Tyler:
 "Tim's logging message update will be really helpful to users when
  they're trying to locate a problematic file in the lower filesystem
  with filename encryption enabled.

  You'll recognize the fix from Li, as you commented on that.

  You should also be familiar with my setattr/truncate improvements,
  since you were the one that pointed them out to us (thanks again!).
  Andrew noted the /dev/ecryptfs write count sanitization needed to be
  improved, so I've got a fix in there for that along with some other
  less important cleanups of the /dev/ecryptfs read/write code."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: Fix oops when printing debug info in extent crypto functions
  eCryptfs: Remove unused ecryptfs_read()
  eCryptfs: Check inode changes in setattr
  eCryptfs: Make truncate path killable
  eCryptfs: Infinite loop due to overflow in ecryptfs_write()
  eCryptfs: Replace miscdev read/write magic numbers
  eCryptfs: Report errors in writes to /dev/ecryptfs
  eCryptfs: Sanitize write counts of /dev/ecryptfs
  ecryptfs: Remove unnecessary variable initialization
  ecryptfs: Improve metadata read failure logging
  MAINTAINERS: Update eCryptfs maintainer address
2012-01-25 15:03:04 -08:00
Tyler Hicks
58ded24f0f eCryptfs: Fix oops when printing debug info in extent crypto functions
If pages passed to the eCryptfs extent-based crypto functions are not
mapped and the module parameter ecryptfs_verbosity=1 was specified at
loading time, a NULL pointer dereference will occur.

Note that this wouldn't happen on a production system, as you wouldn't
pass ecryptfs_verbosity=1 on a production system. It leaks private
information to the system logs and is for debugging only.

The debugging info printed in these messages is no longer very useful
and rather than doing a kmap() in these debugging paths, it will be
better to simply remove the debugging paths completely.

https://launchpad.net/bugs/913651

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Daniel DeFreez
Cc: <stable@vger.kernel.org>
2012-01-25 14:43:42 -06:00
Tyler Hicks
f2cb933501 eCryptfs: Remove unused ecryptfs_read()
ecryptfs_read() has been ifdef'ed out for years now and it was
apparently unused before then. It is time to get rid of it for good.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:41 -06:00
Tyler Hicks
a261a03904 eCryptfs: Check inode changes in setattr
Most filesystems call inode_change_ok() very early in ->setattr(), but
eCryptfs didn't call it at all. It allowed the lower filesystem to make
the call in its ->setattr() function. Then, eCryptfs would copy the
appropriate inode attributes from the lower inode to the eCryptfs inode.

This patch changes that and actually calls inode_change_ok() on the
eCryptfs inode, fairly early in ecryptfs_setattr(). Ideally, the call
would happen earlier in ecryptfs_setattr(), but there are some possible
inode initialization steps that must happen first.

Since the call was already being made on the lower inode, the change in
functionality should be minimal, except for the case of a file extending
truncate call. In that case, inode_newsize_ok() was never being
called on the eCryptfs inode. Rather than inode_newsize_ok() catching
maximum file size errors early on, eCryptfs would encrypt zeroed pages
and write them to the lower filesystem until the lower filesystem's
write path caught the error in generic_write_checks(). This patch
introduces a new function, called ecryptfs_inode_newsize_ok(), which
checks if the new lower file size is within the appropriate limits when
the truncate operation will be growing the lower file.

In summary this change prevents eCryptfs truncate operations (and the
resulting page encryptions), which would exceed the lower filesystem
limits or FSIZE rlimits, from ever starting.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Li Wang <liwang@nudt.edu.cn>
Cc: <stable@vger.kernel.org>
2012-01-25 14:43:41 -06:00
Tyler Hicks
5e6f0d7690 eCryptfs: Make truncate path killable
ecryptfs_write() handles the truncation of eCryptfs inodes. It grabs a
page, zeroes out the appropriate portions, and then encrypts the page
before writing it to the lower filesystem. It was unkillable and due to
the lack of sparse file support could result in tying up a large portion
of system resources, while encrypting pages of zeros, with no way for
the truncate operation to be stopped from userspace.

This patch adds the ability for ecryptfs_write() to detect a pending
fatal signal and return as gracefully as possible. The intent is to
leave the lower file in a useable state, while still allowing a user to
break out of the encryption loop. If a pending fatal signal is detected,
the eCryptfs inode size is updated to reflect the modified inode size
and then -EINTR is returned.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: <stable@vger.kernel.org>
2012-01-25 14:43:40 -06:00
Li Wang
684a3ff7e6 eCryptfs: Infinite loop due to overflow in ecryptfs_write()
ecryptfs_write() can enter an infinite loop when truncating a file to a
size larger than 4G. This only happens on architectures where size_t is
represented by 32 bits.

This was caused by a size_t overflow due to it incorrectly being used to
store the result of a calculation which uses potentially large values of
type loff_t.

[tyhicks@canonical.com: rewrite subject and commit message]
Signed-off-by: Li Wang <liwang@nudt.edu.cn>
Signed-off-by: Yunchuan Wen <wenyunchuan@kylinos.com.cn>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:40 -06:00
Tyler Hicks
48399c0b0e eCryptfs: Replace miscdev read/write magic numbers
ecryptfs_miscdev_read() and ecryptfs_miscdev_write() contained many
magic numbers for specifying packet header field sizes and offsets. This
patch defines those values and replaces the magic values.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:40 -06:00
Tyler Hicks
7f13350424 eCryptfs: Report errors in writes to /dev/ecryptfs
Errors in writes to /dev/ecryptfs were being incorrectly reported by
returning 0 or the value of the original write count.

This patch clears up the return code assignment in error paths.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:39 -06:00
Tyler Hicks
db10e55651 eCryptfs: Sanitize write counts of /dev/ecryptfs
A malicious count value specified when writing to /dev/ecryptfs may
result in a a very large kernel memory allocation.

This patch peeks at the specified packet payload size, adds that to the
size of the packet headers and compares the result with the write count
value. The resulting maximum memory allocation size is approximately 532
bytes.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: <stable@vger.kernel.org>
2012-01-25 14:43:39 -06:00
Tim Gardner
bb4503615d ecryptfs: Remove unnecessary variable initialization
Removes unneeded variable initialization in ecryptfs_read_metadata(). Also adds
a small comment to help explain metadata reading logic.

[tyhicks@canonical.com: Pulled out of for-stable patch and wrote commit msg]
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:38 -06:00
Tim Gardner
30373dc0c8 ecryptfs: Improve metadata read failure logging
Print inode on metadata read failure. The only real
way of dealing with metadata read failures is to delete
the underlying file system file. Having the inode
allows one to 'find . -inum INODE`.

[tyhicks@canonical.com: Removed some minor not-for-stable parts]
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-01-25 14:43:38 -06:00
Jan Kara
9b025eb3a8 xfs: Fix missing xfs_iunlock() on error recovery path in xfs_readlink()
Commit b52a360b forgot to call xfs_iunlock() when it detected corrupted
symplink and bailed out. Fix it by jumping to 'out' instead of doing return.

CC: stable@kernel.org
CC: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Alex Elder <elder@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-25 11:01:31 -06:00
Eric W. Biederman
fea478d410 sysctl: Add register_sysctl for normal sysctl users
The plan is to convert all callers of register_sysctl_table
and register_sysctl_paths to register_sysctl.  The interface
to register_sysctl is enough nicer this should make the callers
a bit more readable.  Additionally after the conversion the
230 lines of backwards compatibility can be removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
ac13ac6f4c sysctl: Index sysctl directories with rbtrees.
One of the most important jobs of sysctl is to export network stack
tunables.  Several of those tunables are per network device.  In
several instances people are running with 1000+ network devices in
there network stacks, which makes the simple per directory linked list
in sysctl a scaling bottleneck.   Replace O(N^2) sysctl insertion and
lookup times with O(NlogN) by using an rbtree to index the sysctl
directories.

Benchmark before:
    make-dummies 0 999 -> 0.32s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 1m17s
    rmmod dummy         -> 17s

Benchmark after:
    make-dummies 0 999 -> 0.074s
    rmmod dummy        -> 0.070s
    make-dummies 0 9999 -> 3.4s
    rmmod dummy         -> 0.44s

Benchmark after (without dev_snmp6):
    make-dummies 0 9999 -> 0.75s
    rmmod dummy         -> 0.44s
    make-dummies 0 99999 -> 11s
    rmmod dummy          -> 4.3s

At 10,000 dummy devices the bottleneck becomes the time to add and
remove the files under /proc/sys/net/dev_snmp6.  I have commented
out the code that adds and removes files under /proc/sys/net/dev_snmp6
and taken measurments of creating and destroying 100,000 dummies to
verify the sysctl continues to scale.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
9e3d47df35 sysctl: Make the header lists per directory.
Slightly enhance efficiency and clarity of the code by making the
header list per directory instead of per set.

Benchmark before:
    make-dummies 0 999 -> 0.63s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 2m35s
    rmmod dummy         -> 18s

Benchmark after:
    make-dummies 0 999 -> 0.32s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 1m17s
    rmmod dummy         -> 17s

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
e54012cede sysctl: Move sysctl_check_dups into insert_header
Simplify the callers of insert_header by removing explicit calls to check
for duplicates and instead have insert_header do the work.

This makes the code slightly more maintainable by enabling changes to
data structures where the insertion of new entries without duplicate
suppression is not possible.

There is not always a convenient path string where insert_header
is called so modify sysctl_check_dups to use sysctl_print_dir
when printing the full path when a duplicate is discovered.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
60a47a2e82 sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
An nsproxy argument here has always been awkard and now the nsproxy argument
is completely unnecessary so remove it, replacing it with the set we want
the registered tables to show up in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:30 -08:00
Eric W. Biederman
0e47c99d7f sysctl: Replace root_list with links between sysctl_table_sets.
Piecing together directories by looking first in one directory
tree, than in another directory tree and finally in a third
directory tree makes it hard to verify that some directory
entries are not multiply defined and makes it hard to create
efficient implementations the sysctl filesystem.

Replace the sysctl wide list of roots with autogenerated
links from the core sysctl directory tree to the other
sysctl directory trees.

This simplifies sysctl directory reading and lookups as now
only entries in a single sysctl directory tree need to be
considered.

Benchmark before:
    make-dummies 0 999 -> 0.44s
    rmmod dummy        -> 0.065s
    make-dummies 0 9999 -> 1m36s
    rmmod dummy         -> 0.4s

Benchmark after:
    make-dummies 0 999 -> 0.63s
    rmmod dummy        -> 0.12s
    make-dummies 0 9999 -> 2m35s
    rmmod dummy         -> 18s

The slowdown is caused by the lookups used in insert_headers
and put_links to see if we need to add links or remove links.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
6980128fe1 sysctl: Add sysctl_print_dir and use it in get_subdir
When there are errors it is very nice to know the full sysctl path.
Add a simple function that computes the sysctl path and prints it
out.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
7ec66d0636 sysctl: Stop requiring explicit management of sysctl directories
Simplify the code and the sysctl semantics by autogenerating
sysctl directories when a sysctl table is registered that needs
the directories and autodeleting the directories when there are
no more sysctl tables registered that need them.

Autogenerating directories keeps sysctl tables from depending
on each other, removing all of the arcane register/unregister
ordering constraints and makes it impossible to get the order
wrong when reigsering and unregistering sysctl tables.

Autogenerating directories yields one unique entity that dentries
can point to, retaining the current effective use of the dcache.

Add struct ctl_dir as the type of these new autogenerated
directories.

The attached_by and attached_to fields in ctl_table_header are
removed as they are no longer needed.

The child field in ctl_table is no longer needed by the core of
the sysctl code.  ctl_table.child can be removed once all of the
existing users have been updated.

Benchmark before:
    make-dummies 0 999 -> 0.7s
    rmmod dummy        -> 0.07s
    make-dummies 0 9999 -> 1m10s
    rmmod dummy         -> 0.4s

Benchmark after:
    make-dummies 0 999 -> 0.44s
    rmmod dummy        -> 0.065s
    make-dummies 0 9999 -> 1m36s
    rmmod dummy         -> 0.4s

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
9eb47c26f0 sysctl: Add a root pointer to ctl_table_set
Add a ctl_table_root pointer to ctl_table set so it is easy to
go from a ctl_table_set to a ctl_table_root.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
6a75ce167c sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
Replace sysctl_head_next with first_entry and next_entry.  These new
iterators operate at the level of sysctl table entries and filter
out any sysctl tables that should not be shown.

Utilizing two specialized functions instead of a single function removes
conditionals for handling awkward special cases that only come up
at the beginning of iteration, making the iterators easier to read
and understand.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
076c3eed2c sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
Replace the helpers that proc_sys_lookup uses with helpers that work
in terms of an entire sysctl directory.  This is worse for sysctl_lock
hold times but it is much better for code clarity and the code cleanups
to come.

find_in_table is no longer needed so it is removed.

find_entry a general helper to find entries in a directory is added.

lookup_entry is a simple wrapper around find_entry that takes the
sysctl_lock increases the use count if an entry is found and drops
the sysctl_lock.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:29 -08:00
Eric W. Biederman
a194558e86 sysctl: Normalize the root_table data structure.
Every other directory has a .child member and we look at the .child
for our entries.  Do the same for the root_table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
8425d6aaf0 sysctl: Factor out insert_header and erase_header
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
e0d045290a sysctl: Factor out init_header from __register_sysctl_paths
Factor out a routing to initialize the sysctl_table_header.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
938aaa4f92 sysctl: Initial support for auto-unregistering sysctl tables.
Add nreg to ctl_table_header.  When nreg drops to 0 the ctl_table_header
will be unregistered.

Factor out drop_sysctl_table from unregister_sysctl_table, and add
the logic for decrementing nreg.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
3cc3e04636 sysctl: A more obvious version of grab_header.
Instead of relying on sysct_head_next(NULL) to magically
return the right header for the root directory instead
explicitly transform NULL into the root directories header.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
8d6ecfcc01 sysctl: Remove the now unused ctl_table parent field.
While useful at one time for selinux and the sysctl sanity
checks those users no longer use the parent field and we can
safely remove it.

Inspired-by: Lucian Adrian Grijincu <lucian.grijincu@gmil.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:28 -08:00
Eric W. Biederman
7c60c48f58 sysctl: Improve the sysctl sanity checks
- Stop validating subdirectories now that we only register leaf tables

- Cleanup and improve the duplicate filename check.
  * Run the duplicate filename check under the sysctl_lock to guarantee
    we never add duplicate names.
  * Reduce the duplicate filename check to nearly O(M*N) where M is the
    number of entries in tthe table we are registering and N is the
    number of entries in the directory before we got there.

- Move the duplicate filename check into it's own function and call
  it directtly from __register_sysctl_table

- Kill the config option as the sanity checks are now cheap enough
  the config option is unnecessary. The original reason for the config
  option was because we had a huge table used to verify the proc filename
  to binary sysctl mapping.  That table has now evolved into the binary_sysctl
  translation layer and is no longer part of the sysctl_check code.

- Tighten up the permission checks.  Guarnateeing that files only have read
  or write permissions.

- Removed redudant check for parents having a procname as now everything has
  a procname.

- Generalize the backtrace logic so that we print a backtrace from
  any failure of __register_sysctl_table that was not caused by
  a memmory allocation failure.  The backtrace allows us to track
  down who erroneously registered a sysctl table.

Bechmark before (CONFIG_SYSCTL_CHECK=y):
    make-dummies 0 999 -> 12s
    rmmod dummy        -> 0.08s

Bechmark before (CONFIG_SYSCTL_CHECK=n):
    make-dummies 0 999 -> 0.7s
    rmmod dummy        -> 0.06s
    make-dummies 0 99999 -> 1m13s
    rmmod dummy          -> 0.38s

Benchmark after:
    make-dummies 0 999 -> 0.65s
    rmmod dummy        -> 0.055s
    make-dummies 0 9999 -> 1m10s
    rmmod dummy         -> 0.39s

The sysctl sanity checks now impose no measurable cost.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:27 -08:00
Eric W. Biederman
f728019bb7 sysctl: register only tables of sysctl files
Split the registration of a complex ctl_table array which may have
arbitrary numbers of directories (->child != NULL) and tables of files
into a series of simpler registrations that only register tables of files.

Graphically:

   register('dir', { + file-a
                     + file-b
                     + subdir1
                       + file-c
                     + subdir2
                       + file-d
                       + file-e })

is transformed into:
   wrapper->subheaders[0] = register('dir', {file1-a, file1-b})
   wrapper->subheaders[1] = register('dir/subdir1', {file-c})
   wrapper->subheaders[2] = register('dir/subdir2', {file-d, file-e})
   return wrapper

This guarantees that __register_sysctl_table will only see a simple
ctl_table array with all entries having (->child == NULL).

Care was taken to pass the original simple ctl_table arrays to
__register_sysctl_table whenever possible.

This change is derived from a similar patch written
by Lucrian Grijincu.

Inspired-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:40:27 -08:00
Eric W. Biederman
ec6a52668d sysctl: Add ctl_table chains into cstring paths
For any component of table passed to __register_sysctl_paths
that actually serves as a path, add that to the cstring path
that is passed to __register_sysctl_table.

The result is that for most calls to __register_sysctl_paths
we only pass a table to __register_sysctl_table that contains
no child directories.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
6e9d516415 sysctl: Add support for register sysctl tables with a normal cstring path.
Make __register_sysctl_table the core sysctl registration operation and
make it take a char * string as path.

Now that binary paths have been banished into the real of backwards
compatibility in kernel/binary_sysctl.c where they can be safely
ignored there is no longer a need to use struct ctl_path to represent
path names when registering ctl_tables.

Start the transition to using normal char * strings to represent
pathnames when registering sysctl tables.  Normal strings are easier
to deal with both in the internal sysctl implementation and for
programmers registering sysctl tables.

__register_sysctl_paths is turned into a backwards compatibility wrapper
that converts a ctl_path array into a normal char * string.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
f05e53a7fb sysctl: Create local copies of directory names used in paths
Creating local copies of directory names is a good idea for
two reasons.
- The dynamic names used by callers must be copied into new
  strings by the callers today to ensure the strings do not
  change between register and unregister of the sysctl table.

- Sysctl directories have a potentially different lifetime
  than the time between register and unregister of any
  particular sysctl table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
bd295b56cf sysctl: Remove the unnecessary sysctl_set parent concept.
In sysctl_net register the two networking roots in the proper order.

In register_sysctl walk the sysctl sets in the reverse order of the
sysctl roots.

Remove parent from ctl_table_set and setup_sysctl_set as it is no
longer needed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
97324cd804 sysctl: Implement retire_sysctl_set
This adds a small helper retire_sysctl_set to remove the intimate knowledge about
the how a sysctl_set is implemented from net/sysct_net.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
a15e20982e sysctl: Make the directories have nlink == 1
I goofed when I made sysctl directories have nlink == 0.
nlink == 0 means the directory has been deleted.
nlink == 1 meands a directory does not count subdirectories.

Use the default nlink == 1 for sysctl directories.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:55 -08:00
Eric W. Biederman
1f87f0b52b sysctl: Move the implementation into fs/proc/proc_sysctl.c
Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
into fs/proc/proc_sysctl.c.

Currently sysctl maintenance is hampered by the sysctl implementation
being split across 3 files with artificial layering between them.
Consolidate the entire sysctl implementation into 1 file so that
it is easier to see what is going on and hopefully allowing for
simpler maintenance.

For functions that are now only used in fs/proc/proc_sysctl.c remove
their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Eric W. Biederman
de4e83bd6b sysctl: Register the base sysctl table like any other sysctl table.
Simplify the code by treating the base sysctl table like any other
sysctl table and register it with register_sysctl_table.

To ensure this table is registered early enough to avoid problems
call sysctl_init from proc_sys_init.

Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
static.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Lucas De Marchi
36885d7b11 sysctl: remove impossible condition check
Remove checks for conditions that will never happen. If procname is NULL
the loop would already had bailed out, so there's no need to check it
again.

At the same time this also compacts the function find_in_table() by
refactoring it to be easier to read.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Vitaly Kuznetsov
c56d8a7362 sysfs: change permissions for /sys from 0755 to 0555
There is a misleading difference between /proc and /sys permissions, /proc is 0555 and /sys is 0755. But
as it is impossible to create or unlink something in /sys it would be nice to have same permissions.

Signed-off-by: Vitaly Kuznetsov <vitty@altlinux.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 15:57:53 -08:00
Konstantin Khlebnikov
e9aba5158a tty: rework pty count limiting
After adding devpts multiple-insrances sysctl kernel.pty.max limit pty count for
each devpts instance independently, while kernel.pty.nr shows total pty count.

This patch restores sysctl kernel.pty.max as global limit (4096 by default),
adds pty reseve for main devpts (mounted without "newinstance" argument),
and new sysctl to tune it: kernel.pty.reserve (1024 by default)

Also it adds devpts mount option "max=%d" to limit pty count for each devpts
instance independently. (by default NR_UNIX98_PTY_MAX == 2^20)

Thus devpts instances in containers cannot eat up all available pty even if we didn't
set any limits, while with "max" argument we can adjust limits more precisely.

Plus, now open("/dev/ptmx") return -ENOSPC in case lack of pty indexes,
this is more informative than -EIO.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 14:01:01 -08:00
Konstantin Khlebnikov
a4834c102f tty: move pty count limiting into devpts
Let's move this stuff to the better place, where we can account pty right in
tty-indexes managing code.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 14:00:41 -08:00
Eric W. Biederman
524b6c5b39 sysfs: Kill nlink counting.
Tracking the number of subdirectories requires an extra field that increases
the size of sysfs_dirent.  nlinks are not particularly interesting for sysfs
and the nlink counts are wrong when network namespaces are involved so stop
counting them, and always return nlink == 1.  Userspace already knows that
directories with nlink == 1 have an nlink count they can't use to count
subdirectories.

This reduces the size of sysfs_dirent by 8 bytes on 64bit platforms.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 12:41:46 -08:00
Eric W. Biederman
cafa6b5dd7 sysfs: Store the sysfs inode in an unsigned int.
Store the sysfs inode number in an unsided int because
ida inode allocator can return at most a 31 bit number,
reducing the size of struct sysfs_dirent by 8 bytes
on 64bit platforms.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 12:41:46 -08:00
Eric W. Biederman
15a3382451 sysfs: Reduce s_flags to an unsinged short so it packs well with s_mode
On 32bit this reduces sizeof(struct sysfs_dirent) by 2 bytes.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 12:41:00 -08:00
Eric W. Biederman
4e4d6d860b sysfs: Add s_hash to sysfs_dirent and order directory entries by hash
Compute a 31 bit hash of directory entries (that can fit in a signed
32bit off_t) and index the sysfs directory entries by that hash,
replacing the per directory indexes by name and by inode.  Because we
now only use a single rbtree this reduces the size of sysfs_dirent by 2
pointers.  Because we have fewer cases to deal with the code is now
simpler.

For now I use the simple hash that the dcache uses as that is easy to
use and seems simple enough.

In addition to makeing the code simpler using a hash for the file
position in readdir brings sysfs in line with other filesystems that
have non-trivial directory structures.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 12:36:17 -08:00
Linus Torvalds
d2346963bf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Pass information that quota is stored in system file to userspace
  ext2: protect inode changes in the SETVERSION and SETFLAGS ioctls
  jbd: Issue cache flush after checkpointing
2012-01-24 12:12:40 -08:00
Eric W. Biederman
ce59791936 sysfs: Complain bitterly about attempts to remove files from nonexistent directories.
Recently an OOPS was observed from the usb serial io_ti driver when it tried to remove
sysfs directories.  Upon investigation it turns out this driver was always buggy
and that a recent sysfs change had stopped guarding itself against removing attributes
from sysfs directories that had already been removed. :(

Historically we have been silent about attempting to files from nonexistent sysfs
directories and have politely returned error codes.  That has resulted in people writing
broken code that ignores the error codes.

Issue a kernel WARNING and a stack backtrace to make it clear in no uncertain
terms that abusing sysfs is not ok, and the callers need to fix their code.

This change transforms the io_ti OOPS into a more comprehensible error message
and stack backtrace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Wolfgang Frisch <wfpub@roembden.net>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 12:12:32 -08:00
Randy Dunlap
0863b04d15 kernel-doc: fix new warnings in debugfs
Fix new kernel-doc warnings:

Warning(fs/debugfs/file.c:556): No description found for parameter 'nregs'
Warning(fs/debugfs/file.c:556): Excess function parameter 'mregs' description in 'debugfs_print_regs32'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-24 10:47:41 -08:00
Dan Carpenter
803ab97761 cifs: NULL dereference on allocation failure
We should just return directly here, the goto causes a NULL dereference.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-24 10:37:19 -06:00
Dan Magenheimer
3167760f83 mm: cleancache: s/flush/invalidate/
Per akpm suggestions alter the use of the term flush to be
invalidate. The next patch will do this across all MM.

This change is completely cosmetic.

[v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jan Beulich <JBeulich@novell.com>
Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
[v10: Fixed  fs: move code out of buffer.c conflict change]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-01-23 16:06:24 -05:00
Linus Torvalds
a99cbf6b43 Merge branch 'kernel-doc' from Randy Dunlap
The usual kernel-doc fixups from Randy.  Some of them David acked as
merged in his tree, this is the random left-overs.

* kernel-doc:
  docbook: fix sched source file names in device-drivers book
  docbook: change iomap source filename in deviceiobook
  docbook: don't use serial_core.h in device-drivers book
  kernel-doc: fix kernel-doc warnings in sched
  kernel-doc: fix new warnings in cfg80211.h
  kernel-doc: fix new warning in usb.h
  kernel-doc: fix new warnings in device.h
  kernel-doc: fix new warnings in debugfs
  kernel-doc: fix new warning in regulator core
  kernel-doc: fix new warnings in pci
  kernel-doc: fix new warnings in driver-core
  kernel-doc: fix new warnings in auditsc.c
  scripts/kernel-doc: fix fatal error caused by cfg80211.h
2012-01-23 10:08:08 -08:00
Linus Torvalds
4f57d865f1 Merge branch 'akpm'
Quoth Andrew:
  "Random fixes.  And a simple new LED driver which I'm trying to sneak
   in while you're not looking."

Sneaking successful.

* akpm:
  score: fix off-by-one index into syscall table
  mm: fix rss count leakage during migration
  SHM_UNLOCK: fix Unevictable pages stranded after swap
  SHM_UNLOCK: fix long unpreemptible section
  kdump: define KEXEC_NOTE_BYTES arch specific for s390x
  mm/hugetlb.c: undo change to page mapcount in fault handler
  mm: memcg: update the correct soft limit tree during migration
  proc: clear_refs: do not clear reserved pages
  drivers/video/backlight/l4f00242t03.c: return proper error in l4f00242t03_probe if regulator_get() fails
  drivers/video/backlight/adp88x0_bl.c: fix bit testing logic
  kprobes: initialize before using a hlist
  ipc/mqueue: simplify reading msgqueue limit
  leds: add led driver for Bachmann's ot200
  mm: __count_immobile_pages(): make sure the node is online
  mm: fix NULL ptr dereference in __count_immobile_pages
  mm: fix warnings regarding enum migrate_mode
2012-01-23 09:27:54 -08:00
Linus Torvalds
7908b3ef68 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Rename *UCS* functions to *UTF16*
  [CIFS] ACL and FSCACHE support no longer EXPERIMENTAL
  [CIFS] Fix build break with multiuser patch when LANMAN disabled
  cifs: warn about impending deprecation of legacy MultiuserMount code
  cifs: fetch credentials out of keyring for non-krb5 auth multiuser mounts
  cifs: sanitize username handling
  keys: add a "logon" key type
  cifs: lower default wsize when unix extensions are not used
  cifs: better instrumentation for coalesce_t2
  cifs: integer overflow in parse_dacl()
  cifs: Fix sparse warning when calling cifs_strtoUCS
  CIFS: Add descriptions to the brlock cache functions
2012-01-23 08:59:49 -08:00
Randy Dunlap
b5763accd3 kernel-doc: fix new warnings in debugfs
Fix new kernel-doc warnings:

Warning(fs/debugfs/file.c:556): No description found for parameter 'nregs'
Warning(fs/debugfs/file.c:556): Excess function parameter 'mregs' description in 'debugfs_print_regs32'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:44:53 -08:00
Will Deacon
85e72aa538 proc: clear_refs: do not clear reserved pages
/proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
pages and corresponding page table entries of the task with PID pid, which
includes any special mappings inserted into the page tables in order to
provide things like vDSOs and user helper functions.

On ARM this causes a problem because the vectors page is mapped as a
global mapping and since ec706dab ("ARM: add a vma entry for the user
accessible vector page"), a VMA is also inserted into each task for this
page to aid unwinding through signals and syscall restarts.  Since the
vectors page is required for handling faults, clearing the YOUNG bit (and
subsequently writing a faulting pte) means that we lose the vectors page
*globally* and cannot fault it back in.  This results in a system deadlock
on the next exception.

To see this problem in action, just run:

	$ echo 1 > /proc/self/clear_refs

on an ARM platform (as any user) and watch your system hang.  I think this
has been the case since 2.6.37

This patch avoids clearing the aforementioned bits for reserved pages,
therefore leaving the vectors page intact on ARM.  Since reserved pages
are not candidates for swap, this change should not have any impact on the
usefulness of clear_refs.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Reported-by: Moussa Ba <moussaba@micron.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Nicolas Pitre <nico@linaro.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: <stable@vger.kernel.org>		[2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:38:48 -08:00
Linus Torvalds
567e47935a Merge branches 'sched-urgent-for-linus', 'perf-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/accounting, proc: Fix /proc/stat interrupts sum

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
  x86/kprobes: Add arch/x86/tools/insn_sanity to .gitignore
  x86/kprobes: Fix typo transferred from Intel manual

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
  x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
  x86, opcode: ANDN and Group 17 in x86-opcode-map.txt
  x86/kconfig: Move the ZONE_DMA entry under a menu
  x86/UV2: Add accounting for BAU strong nacks
  x86/UV2: Ack BAU interrupt earlier
  x86/UV2: Remove stale no-resources test for UV2 BAU
  x86/UV2: Work around BAU bug
  x86/UV2: Fix BAU destination timeout initialization
  x86/UV2: Fix new UV2 hardware by using native UV2 broadcast mode
  x86: Get rid of dubious one-bit signed bitfield
2012-01-19 14:53:06 -08:00
Linus Torvalds
e19c29e8d8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  qnx4: don't leak ->BitMap on late failure exits
  qnx4: reduce the insane nesting in qnx4_checkroot()
  qnx4: di_fname is an array, for crying out loud...
  vfs: remove printk from set_nlink()
  wake up s_wait_unfrozen when ->freeze_fs fails
2012-01-19 14:49:16 -08:00
Al Viro
8bc5191b26 qnx4: don't leak ->BitMap on late failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-19 13:54:36 -05:00
Al Viro
4134bf81ff qnx4: reduce the insane nesting in qnx4_checkroot()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-19 13:40:57 -05:00
Al Viro
1aab323ea5 qnx4: di_fname is an array, for crying out loud...
(struct qnx4_inode_entry *)(bh->b_data + some_offset)->di_fname
is not going to be NULL, TYVM...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-19 13:19:42 -05:00
Steve French
acbbb76a26 CIFS: Rename *UCS* functions to *UTF16*
to reflect the unicode encoding used by CIFS protocol.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@samba.org>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
2012-01-18 22:32:33 -06:00
David Howells
700920eb5b KEYS: Allow special keyrings to be cleared
The kernel contains some special internal keyrings, for instance the DNS
resolver keyring :

2a93faf1 I-----     1 perm 1f030000     0     0 keyring   .dns_resolver: empty

It would occasionally be useful to allow the contents of such keyrings to be
flushed by root (cache invalidation).

Allow a flag to be set on a keyring to mark that someone possessing the
sysadmin capability can clear the keyring, even without normal write access to
the keyring.

Set this flag on the special keyrings created by the DNS resolver, the NFS
identity mapper and the CIFS identity mapper.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2012-01-19 14:38:51 +11:00
Steve French
c56001879b [CIFS] ACL and FSCACHE support no longer EXPERIMENTAL
CIFS ACL support and FSCACHE support have been in long enough
to be no longer considered experimental.  Remove obsolete Kconfig
dependency.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2012-01-18 17:55:41 -06:00
Steve French
88a4412b79 [CIFS] Fix build break with multiuser patch when LANMAN disabled
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-01-18 17:13:47 -06:00
Jeff Layton
789b4588da cifs: warn about impending deprecation of legacy MultiuserMount code
We'll allow a grace period of 2 releases (3.3 and 3.4) and then remove
the legacy code in 3.5.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-17 22:40:31 -06:00
Jeff Layton
8a8798a5ff cifs: fetch credentials out of keyring for non-krb5 auth multiuser mounts
Fix up multiuser mounts to set the secType and set the username and
password from the key payload in the vol info for non-krb5 auth types.

Look for a key of type "secret" with a description of
"cifs🅰️<server address>" or "cifs:d:<domainname>". If that's found,
then scrape the username and password out of the key payload and use
that to create a new user session.

Finally, don't have the code enforce krb5 auth on multiuser mounts,
but do require a kernel with keys support.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-17 22:40:28 -06:00
Jeff Layton
04febabcf5 cifs: sanitize username handling
Currently, it's not very clear whether you're allowed to have a NULL
vol->username or ses->user_name. Some places check for it and some don't.

Make it clear that a NULL pointer is OK in these fields, and ensure that
all the callers check for that.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-17 22:40:26 -06:00
Jeff Layton
ce91acb3ac cifs: lower default wsize when unix extensions are not used
We've had some reports of servers (namely, the Solaris in-kernel CIFS
server) that don't deal properly with writes that are "too large" even
though they set CAP_LARGE_WRITE_ANDX. Change the default to better
mirror what windows clients do.

Cc: stable@vger.kernel.org
Cc: Pavel Shilovsky <piastry@etersoft.ru>
Reported-by: Nick Davis <phireph0x@yahoo.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-17 22:39:37 -06:00
Jeff Layton
f5fffcee27 cifs: better instrumentation for coalesce_t2
When coalesce_t2 returns an error, have it throw a cFYI message that
explains the reason. Also rename some variables to clarify what they
represent.

Reported-and-Tested-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-17 22:39:34 -06:00
Linus Torvalds
f429ee3b80 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
  audit: no leading space in audit_log_d_path prefix
  audit: treat s_id as an untrusted string
  audit: fix signedness bug in audit_log_execve_info()
  audit: comparison on interprocess fields
  audit: implement all object interfield comparisons
  audit: allow interfield comparison between gid and ogid
  audit: complex interfield comparison helper
  audit: allow interfield comparison in audit rules
  Kernel: Audit Support For The ARM Platform
  audit: do not call audit_getname on error
  audit: only allow tasks to set their loginuid if it is -1
  audit: remove task argument to audit_set_loginuid
  audit: allow audit matching on inode gid
  audit: allow matching on obj_uid
  audit: remove audit_finish_fork as it can't be called
  audit: reject entry,always rules
  audit: inline audit_free to simplify the look of generic code
  audit: drop audit_set_macxattr as it doesn't do anything
  audit: inline checks for not needing to collect aux records
  audit: drop some potentially inadvisable likely notations
  ...

Use evil merge to fix up grammar mistakes in Kconfig file.

Bad speling and horrible grammar (and copious swearing) is to be
expected, but let's keep it to commit messages and comments, rather than
expose it to users in config help texts or printouts.
2012-01-17 16:41:31 -08:00
Linus Torvalds
22b4eb5e31 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: cleanup xfs_file_aio_write
  xfs: always return with the iolock held from xfs_file_aio_write_checks
  xfs: remove the i_new_size field in struct xfs_inode
  xfs: remove the i_size field in struct xfs_inode
  xfs: replace i_pin_wait with a bit waitqueue
  xfs: replace i_flock with a sleeping bitlock
  xfs: make i_flags an unsigned long
  xfs: remove the if_ext_max field in struct xfs_ifork
  xfs: remove the unused dm_attrs structure
  xfs: cleanup xfs_iomap_eof_align_last_fsb
  xfs: remove xfs_itruncate_data
2012-01-17 15:54:56 -08:00
Linus Torvalds
d65773b22b Merge branch 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: take allocation of ->tree_root into open_ctree()
  btrfs: let ->s_fs_info point to fs_info, not root...
  btrfs: consolidate failure exits in btrfs_mount() a bit
  btrfs: make free_fs_info() call ->kill_sb() unconditional
  btrfs: merge free_fs_info() calls on fill_super failures
  btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
  btrfs: make open_ctree() return int
  btrfs: sanitizing ->fs_info, part 5
  btrfs: sanitizing ->fs_info, part 4
  btrfs: sanitizing ->fs_info, part 3
  btrfs: sanitizing ->fs_info, part 2
  btrfs: sanitizing ->fs_info, part 1
  btrfs: fix a deadlock in btrfs_scan_one_device()
  btrfs: fix mount/umount race
  btrfs: get ->kill_sb() of its own
  btrfs: preparation to fixing mount/umount race
2012-01-17 15:52:51 -08:00
Linus Torvalds
f9156c7288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: use larger system chunks
  Btrfs: add a delalloc mutex to inodes for delalloc reservations
  Btrfs: space leak tracepoints
  Btrfs: protect orphan block rsv with spin_lock
  Btrfs: add allocator tracepoints
  Btrfs: don't call btrfs_throttle in file write
  Btrfs: release space on error in page_mkwrite
  Btrfs: fix btrfsck error 400 when truncating a compressed
  Btrfs: do not use btrfs_end_transaction_throttle everywhere
  Btrfs: add balance progress reporting
  Btrfs: allow for resuming restriper after it was paused
  Btrfs: allow for canceling restriper
  Btrfs: allow for pausing restriper
  Btrfs: add skip_balance mount option
  Btrfs: recover balance on mount
  Btrfs: save balance parameters to disk
  Btrfs: soft profile changing mode (aka soft convert)
  Btrfs: implement online profile changing
  Btrfs: do not reduce profile in do_chunk_alloc()
  Btrfs: virtual address space subset filter
  ...

Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
2012-01-17 15:49:54 -08:00
Linus Torvalds
e268337dfe proc: clean up and fix /proc/<pid>/mem handling
Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
robust, and it also doesn't match the permission checking of any of the
other related files.

This changes it to do the permission checks at open time, and instead of
tracking the process, it tracks the VM at the time of the open.  That
simplifies the code a lot, but does mean that if you hold the file
descriptor open over an execve(), you'll continue to read from the _old_
VM.

That is different from our previous behavior, but much simpler.  If
somebody actually finds a load where this matters, we'll need to revert
this commit.

I suspect that nobody will ever notice - because the process mapping
addresses will also have changed as part of the execve.  So you cannot
actually usefully access the fd across a VM change simply because all
the offsets for IO would have changed too.

Reported-by: Jüri Aedla <asd@ut.ee>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-17 15:21:19 -08:00
Miklos Szeredi
424a5334a5 vfs: remove printk from set_nlink()
Don't log a message for set_nlink(0).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-17 16:39:47 -05:00
Kazuya Mio
e1616300a2 wake up s_wait_unfrozen when ->freeze_fs fails
dd slept infinitely when fsfeeze failed because of EIO.
To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
the tasks waiting for the filesystem to become unfrozen.

When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.

However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
freeze_super() returns an error number. In this case, FITHAW ioctl returns
EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-17 16:38:47 -05:00
Eric Paris
4043cde8ec audit: do not call audit_getname on error
Just a code cleanup really.  We don't need to make a function call just for
it to return on error.  This also makes the VFS function even easier to follow
and removes a conditional on a hot path.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:01 -05:00
Eric Paris
633b454545 audit: only allow tasks to set their loginuid if it is -1
At the moment we allow tasks to set their loginuid if they have
CAP_AUDIT_CONTROL.  In reality we want tasks to set the loginuid when they
log in and it be impossible to ever reset.  We had to make it mutable even
after it was once set (with the CAP) because on update and admin might have
to restart sshd.  Now sshd would get his loginuid and the next user which
logged in using ssh would not be able to set his loginuid.

Systemd has changed how userspace works and allowed us to make the kernel
work the way it should.  With systemd users (even admins) are not supposed
to restart services directly.  The system will restart the service for
them.  Thus since systemd is going to loginuid==-1, sshd would get -1, and
sshd would be allowed to set a new loginuid without special permissions.

If an admin in this system were to manually start an sshd he is inserting
himself into the system chain of trust and thus, logically, it's his
loginuid that should be used!  Since we have old systems I make this a
Kconfig option.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Eric Paris
0a300be6d5 audit: remove task argument to audit_set_loginuid
The function always deals with current.  Don't expose an option
pretending one can use it for something.  You can't.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Christoph Hellwig
d060646436 xfs: cleanup xfs_file_aio_write
With all the size field updates out of the way xfs_file_aio_write can
be further simplified by pushing all iolock handling into
xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using
the generic generic_write_sync helper for synchronous writes.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17 15:12:33 -06:00
Christoph Hellwig
5bf1f26227 xfs: always return with the iolock held from xfs_file_aio_write_checks
While xfs_iunlock is fine with 0 lockflags the calling conventions are much
cleaner if xfs_file_aio_write_checks never returns without the iolock held.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17 15:11:07 -06:00
Christoph Hellwig
2813d682e8 xfs: remove the i_new_size field in struct xfs_inode
Now that we use the VFS i_size field throughout XFS there is no need for the
i_new_size field any more given that the VFS i_size field gets updated
in ->write_end before unlocking the page, and thus is always uptodate when
writeback could see a page.  Removing i_new_size also has the advantage that
we will never have to trim back di_size during a failed buffered write,
given that it never gets updated past i_size.

Note that currently the generic direct I/O code only updates i_size after
calling our end_io handler, which requires a small workaround to make
sure di_size actually makes it to disk.  I hope to fix this properly in
the generic code.

A downside is that we lose the support for parallel non-overlapping O_DIRECT
appending writes that recently was added.  I don't think keeping the complex
and fragile i_new_size infrastructure for this is a good tradeoff - if we
really care about parallel appending writers we should investigate turning
the iolock into a range lock, which would also allow for parallel
non-overlapping buffered writers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17 15:10:19 -06:00
Christoph Hellwig
ce7ae151dd xfs: remove the i_size field in struct xfs_inode
There is no fundamental need to keep an in-memory inode size copy in the XFS
inode.  We already have the on-disk value in the dinode, and the separate
in-memory copy that we need for regular files only in the XFS inode.

Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
VFS inode i_size field for regular files.  Switch code that was directly
accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
we are limited to regular files direct access of the VFS inode i_size field.

This also allows dropping some fairly complicated code in the write path
which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
that is getting updated inside ->write_end.

Note that we do not bother resetting the VFS i_size when truncating a file
that gets freed to zero as there is no point in doing so because the VFS inode
is no longer in use at this point.  Just relax the assert in xfs_ifree to
only check the on-disk size instead.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17 15:08:53 -06:00