16174 Commits

Author SHA1 Message Date
Eric W. Biederman
832b6af198 sysfs: Propagate renames to the vfs on demand
By teaching sysfs_revalidate to hide a dentry for
a sysfs_dirent if the sysfs_dirent has been renamed,
and by teaching sysfs_lookup to return the original
dentry if the sysfs dirent has been renamed.  I can
show the results of renames correctly without having to
update the dcache during the directory rename.

This massively simplifies the rename logic allowing a lot
of weird sysfs special cases to be removed along with
a lot of now unnecesary helper code.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
a16bbc3430 sysfs: Gut sysfs_addrm_start and sysfs_addrm_finish
With lazy inode updates and dentry operations bringing everything
into sync on demand there is no longer any need to immediately
update the vfs or grab i_mutex to protect those updates as we
make changes to sysfs.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
06fc0d66f7 sysfs: In sysfs_chmod_file lazily propagate the mode change.
Now that sysfs_getattr and sysfs_permission refresh the vfs
inode there is no need to immediatly push the mode change
into the vfs cache.  Reducing the amount of work needed and
simplifying the locking.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
e61ab4ae48 sysfs: Implement sysfs_getattr & sysfs_permission
With the implementation of sysfs_getattr and sysfs_permission
sysfs becomes able to lazily propogate inode attribute changes
from the sysfs_dirents to the vfs inodes.   This paves the way
for deleting significant chunks of now unnecessary code.

While doing this we did not reference sysfs_setattr from
sysfs_symlink_inode_operations so I added along with
sysfs_getattr and sysfs_permission.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
c099aacd48 sysfs: Nicely indent sysfs_symlink_inode_operations
Lining up the functions in sysfs_symlink_inode_operations
follows the pattern in the rest of sysfs and makes things
slightly more readable.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
6b0bfe9383 sysfs: Update s_iattr on link and unlink.
Currently sysfs updates the timestamps on the vfs directory
inode when we create or remove a directory entry but doesn't
update the cached copy on the sysfs_dirent, fix that oversight.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
35df63c46c sysfs: Fix locking and factor out sysfs_sd_setattr
Cleanly separate the work that is specific to setting the
attributes of a sysfs_dirent from what is needed to update
the attributes of a vfs inode.

Additionally grab the sysfs_mutex to keep any nasties from
surprising us when updating the sysfs_dirent.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
4be3df28be sysfs: Simplify iattr time assignments
The granularity of sysfs time when we keep it is 1 ns.  Which
when passed to timestamp_trunc results in a nop.  So remove
the unnecessary function call making sysfs_setattr slightly
easier to read.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
4c6974f51a sysfs: Simplify sysfs_chmod_file semantics
Currently every caller of sysfs_chmod_file happens at either
file creation time to set a non-default mode or in response
to a specific user requested space change in policy.  Making
timestamps of when the chmod happens and notification of
a file changing mode uninteresting.

Remove the unnecessary time stamp and filesystem change
notification, and removes the last of the explicit inotify
and donitfy support from sysfs.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
e8f077c883 sysfs: Use dentry_ops instead of directly playing with the dcache
Calling d_drop unconditionally when a sysfs_dirent is deleted has
the potential to leak mounts, so instead implement dentry delete
and revalidate operations that cause sysfs dentries to be removed
at the appropriate time.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
28a027cfc0 sysfs: Rename sysfs_d_iput to sysfs_dentry_iput
Using dentry instead of d in the function name is what
several other filesystems are doing and it seems to be
a more readable convention.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
f44d3e7857 sysfs: Update sysfs_setxattr so it updates secdata under the sysfs_mutex
The sysfs_mutex is required to ensure updates are and will remain
atomic with respect to other inode iattr updates, that do not happen
through the filesystem.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Mathieu Desnoyers
d3a3b0adad debugfs: fix create mutex racy fops and private data
Setting fops and private data outside of the mutex at debugfs file
creation introduces a race where the files can be opened with the wrong
file operations and private data.  It is easy to trigger with a process
waiting on file creation notification.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Stefan Richter
f38506c49d sysfs: mark a locally-only used function static
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:51 -08:00
Arnd Bergmann
637e8a60a7 usbdevfs: move compat_ioctl handling to devio.c
Half the compat_ioctl handling is in devio.c, the other
half is in fs/compat_ioctl.c. This moves everything into
one place for consistency.

As a positive side-effect, push down the BKL into the
ioctl methods.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: David Vrabel <david.vrabel@csr.com>
Cc: linux-usb@vger.kernel.org
2009-12-10 22:55:37 +01:00
Arnd Bergmann
3695669cd4 lp: move compat_ioctl handling into lp.c
Handling for LPSETTIMEOUT can easily be done in lp_ioctl, which
is the only user. As a positive side-effect, push the BKL
into the ioctl methods.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-10 22:55:36 +01:00
Arnd Bergmann
43c6e7b97f compat_ioctl: pass compat pointer directly to handlers
Instead of having each handler call compat_ptr, we can now
convert the pointer once and pass that to each handler.
This saves a little bit of both source and object code size.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:12 +01:00
Arnd Bergmann
661f627da9 compat_ioctl: simplify lookup table
The compat_ioctl table now only contains entries for
COMPATIBLE_IOCTL, so we only need to know if a number
is listed in it or now.

As an optimization, we hash the table entries with a
reversible transformation to get a more uniform distribution
over it, sort the table at startup and then guess the
position in the table when an ioctl number gets called
to do a linear search from there.

With the current set of ioctl numbers and the chosen
transformation function, we need an average of four
steps to find if a number is in the set, all of the
accesses within one or two cache lines.

This at least as good as the previous hash table
approach but saves 8.5 kb of kernel memory.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:11 +01:00
Arnd Bergmann
789f0f8911 compat_ioctl: simplify calling of handlers
The compat_ioctl array now contains only entries for ioctl numbers
that do not require a separate handler. By special-casing the
ULONG_IOCTL case in the do_ioctl_trans function, we can kill the
final use of a function pointer in the array.

   text    data     bss     dec     hex filename
   7539   13352    2080   22971    59bb before/fs/compat_ioctl.o
   7910    8552    2080   18542    486e after/fs/compat_ioctl.o

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:10 +01:00
Arnd Bergmann
5a07ea0b97 compat_ioctl: inline all conversion handlers
This makes all ioctl conversion handlers called from
a single switch statement, leaving only COMPATIBLE_IOCTL
and ULONG_IOCTL statements in the table. This is somewhat
more space efficient and also lets us simplify the
handling of the lookup table significantly.

before:
   text    data     bss     dec     hex filename
   7619   14024    2080   23723    5cab obj/fs/compat_ioctl.o
after:
   7567   13352    2080   22999    59d7 obj/fs/compat_ioctl.o

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:09 +01:00
Arnd Bergmann
348c4b9078 compat_ioctl: Remove BKL
We have always called ioctl conversion handlers under the big kernel lock,
although that is generally not necessary.  In particular it is not needed
for conversion of data structures and for calling sys_ioctl or
do_vfs_ioctl, which will get the BKL again if needed.

Handlers doing more than those two have been moved out, so we can kill off
the BKL from compat_sys_ioctl.  This may significantly improve latencies
with 32 bit applications, and it avoids a common scenario where a thread
acquires the BKL twice.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-12-10 22:52:08 +01:00
Arnd Bergmann
fb07a5f857 compat_ioctl: remove all VT ioctl handling
The VT driver now handles all of these ioctls directly, so we can remove
the handlers from common code.

These are the only handlers that require the BKL because they directly
perform the ioctl action rather than just converting the data structures.
Once they are gone, we can remove the BKL from the remaining ioctl
conversion handlers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-12-10 22:52:08 +01:00
Linus Torvalds
02412f49f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: always use GFP_NOFS
2009-12-10 09:33:59 -08:00
Linus Torvalds
4515c3069d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
  ext4: Do not override ext2 or ext3 if built they are built as modules
  jbd2: Export jbd2_log_start_commit to fix ext4 build
  ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
  ext4: Wait for proper transaction commit on fsync
  ext4: fix incorrect block reservation on quota transfer.
  ext4: quota macros cleanup
  ext4: ext4_get_reserved_space() must return bytes instead of blocks
  ext4: remove blocks from inode prealloc list on failure
  ext4: wait for log to commit when umounting
  ext4: Avoid data / filesystem corruption when write fails to copy data
  ext4: Use ext4 file system driver for ext2/ext3 file system mounts
  ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
  jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
  ext4: remove unused parameter wbc from __ext4_journalled_writepage()
  ext4: remove encountered_congestion trace
  ext4: move_extent_per_page() cleanup
  ext4: initialize moved_len before calling ext4_move_extents()
  ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
  ext4: use ext4_data_block_valid() in ext4_free_blocks()
  ...
2009-12-10 09:33:29 -08:00
Linus Torvalds
a5eba3f66f Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Multi-device mirror support
  exofs: Move all operations to an io_engine
  exofs: move osd.c to ios.c
  exofs: statfs blocks is sectors not FS blocks
  exofs: Prints on mount and unmout
  exofs: refactor exofs_i_info initialization into common helper
  exofs: dbg-print less
  exofs: More sane debug print
  trivial: some small fixes in exofs documentation
2009-12-10 09:32:24 -08:00
Linus Torvalds
fc1495bf99 Merge git://git.infradead.org/ubifs-2.6
* git://git.infradead.org/ubifs-2.6:
  UBIFS: fix return code in check_leaf
  UBI: flush wl before clearing update marker
  MAINTAINERS: change e-mail of Artem Bityutskiy
  UBIFS: remove manual O_SYNC handling
  UBIFS: support mounting of UBI volume character devices
  UBI: Add ubi_open_volume_path
2009-12-10 09:31:45 -08:00
Trond Myklebust
190f38e5ce NFS: Fix nfs_migrate_page()
The call to migrate_page() will cause the page->private field to be
cleared.
Also fix up the locking around the page->private transfer, so that we ensure
that calls to nfs_page_find_request() don't end up racing.

Finally, fix up a double free bug: nfs_unlock_request() already calls
nfs_release_request() for us...

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Andi Kleen <andi@firstfloor.org>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-10 09:05:55 -05:00
Roel Kluin
8e0eb4011b ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
Return the PTR_ERR of the correct pointer.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:55 +01:00
Jan Kara
68eb3db083 ext3: Fix data / filesystem corruption when write fails to copy data
When ext3_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks already
instantiated beyond i_size. Although these blocks were never inside i_size, we
have to truncate pagecache of these blocks so that corresponding buffers get
unmapped. Otherwise subsequent __block_prepare_write (called because we are
retrying the write) will find the buffers mapped, not call ->get_block, and
thus the page will be backed by already freed blocks leading to filesystem and
data corruption.

Reported-by: James Y Knight <foom@fuhm.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:55 +01:00
Jan Kara
5a20bdfcdc ext4: Support for 64-bit quota format
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
1aeec43432 ext3: Support for vfsv1 quota format
We just have to add proper mount options handling. The rest is handled by
the generic quota code.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
498c60153e quota: Implement quota format with 64-bit space and inode limits
So far the maximum quota space limit was 4TB. Apparently this isn't enough
for Lustre guys anymore. So implement new quota format which raises block
limits to 2^64 bytes. Also store number of inodes and inode limits in
64-bit variables as 2^32 files isn't that insanely high anymore.

The first version of the patch has been developed by Andrew Perepechko
<Andrew.Perepechko@Sun.COM>.

CC: Andrew.Perepechko@Sun.COM
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
3067393005 quota: Move definition of QFMT_OCFS2 to linux/quota.h
Move definition of this constant to linux/quota.h so that it
cannot clash with other format IDs.

CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Jérémy Cochoy
92e128884b ext2: fix comment in ext2_find_entry about return values
Signed-off-by: Jérémy Cochoy <jeremy.cochoy@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Alexey Fisher
4cf46b67eb ext3: Unify log messages in ext3
Make messages produced by ext3 more unified. It should be
easy to parse.

dmesg before patch:
[ 4893.684892] reservations ON
[ 4893.684896] xip option not supported
[ 4893.684964] EXT3-fs warning: maximal mount count reached, running
e2fsck is recommended

dmesg after patch:
[  873.300792] EXT3-fs (loop0): using internal journaln
[  873.300796] EXT3-fs (loop0): mounted filesystem with writeback data mode
[  924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.
[  723.755642] EXT3-fs (loop0): error: bad blocksize 8192
[  357.874687] EXT3-fs (loop0): error: no journal found. mounting ext3 over ext2?
[  873.300764] EXT3-fs (loop0): warning: maximal mount count reached, running e2fsck is recommended
[  924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.

Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Stephen Hemminger
2074abfeb8 ext2: clear uptodate flag on super block I/O error
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during
unmount when a USB or floppy device is removed. I reported this a kernel
regression, but looks like it might have been there for longer
than that.

The super block update from a previous operation has marked the buffer
as in error, and the flag has to be cleared before doing the update.
(Similar code already exists in ext4).

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Alexey Fisher
2314b07cb4 ext2: Unify log messages in ext2
make messages produced by ext2 more unified. It should be
easy to parse.

dmesg before patch:
[ 4893.684892] reservations ON
[ 4893.684896] xip option not supported
[ 4893.684961] EXT2-fs warning: mounting ext3 filesystem as ext2
[ 4893.684964] EXT2-fs warning: maximal mount count reached, running
e2fsck is recommended
[ 4893.684990] EXT II FS: 0.5b, 95/08/09, bs=1024, fs=1024, gc=2,
bpg=8192, ipg=1280, mo=80010]

dmesg after patch:
[ 4893.684892] EXT2-fs (loop0): reservations ON
[ 4893.684896] EXT2-fs (loop0): xip option not supported
[ 4893.684961] EXT2-fs (loop0): warning: mounting ext3 filesystem as
ext2
[ 4893.684964] EXT2-fs (loop0): warning: maximal mount count reached,
running e2fsck is recommended
[ 4893.684990] EXT2-fs (loop0): 0.5b, 95/08/09, bs=1024, fs=1024, gc=2,
bpg=8192, ipg=1280, mo=80010]

Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Reviewed-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
dee1d3b627 ext3: make "norecovery" an alias for "noload"
Users on the list recently complained about differences across
filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext3.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
b918397542 ext3: Don't update the superblock in ext3_statfs()
commit a71ce8c6c9bf269b192f352ea555217815cf027e updated ext3_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

The modifications were originally to keep the sb "more" in sync,
so that a readonly fsck of the device didn't flag this as an
error (as often), but apparently e2fsprogs deals with this differently
now, anyway.

Based on Ted's patch for ext4, which was in turn based on my
work on that bug and another preliminary patch...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
d965736b8c ext3: journal all modifications in ext3_xattr_set_handle
ext3_xattr_set_handle() was zeroing out an inode outside
of journaling constraints; this is one of the accesses that
was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Although ext3 doesn't have the crc issue, modifications
out of journal control are a Bad Thing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Jan Kara
c56818d7dc quota: Fix WARN_ON in lookup_one_len
We should hold i_mutex when looking up quota files for journaled quotas,
otherwise a WARN_ON in lookup_one_len triggers. The fact that we didn't
hold i_mutex previously probably could not lead to a real bug since the
filesystem is just being mounted / remounted read-write and thus the
root directory cannot change anyway but it's definitely cleaner with
i_mutex.

Reported-by: Bastien ROUCARIES <roucaries.bastien@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Alexey Dobriyan
1472da5fdc const: struct quota_format_ops
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Christoph Hellwig
5ced58f735 ubifs: remove manual O_SYNC handling
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate call to ubifs_sync_wbufs_by_inode which is already
covered by ubifs_fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Christoph Hellwig
027cf316af afs: remove manual O_SYNC handling
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate manual invocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Christoph Hellwig
94004ed726 kill wait_on_page_writeback_range
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Christoph Hellwig
6b2f3d1f76 vfs: Implement proper O_SYNC semantics
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment.  This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics.  After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.

This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag.  To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.

This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition.  Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set.  The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.

We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.

Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op.  We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Jan Kara
59bc055211 zisofs: Implement reading of compressed files when PAGE_CACHE_SIZE > compress block size
Also split and cleanup zisofs_readpage() when we are changing it anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:49 +01:00
Boaz Harrosh
04dc1e88ad exofs: Multi-device mirror support
This patch changes on-disk format, it is accompanied with a parallel
patch to mkfs.exofs that enables multi-device capabilities.

After this patch, old exofs will refuse to mount a new formatted FS and
new exofs will refuse an old format. This is done by moving the magic
field offset inside the FSCB. A new FSCB *version* field was added. In
the future, exofs will refuse to mount unmatched FSCB version. To
up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
before mounting.

Introduced, a new object that contains a *device-table*. This object
contains the default *data-map* and a linear array of devices
information, which identifies the devices used in the filesystem. This
object is only written to offline by mkfs.exofs. This is why it is kept
separate from the FSCB, since the later is written to while mounted.

Same partition number, same object number is used on all devices only
the device varies.

* define the new format, then load the device table on mount time make
  sure every thing is supported.

* Change I/O engine to now support Mirror IO, .i.e write same data
  to multiple devices, read from a random device to spread the
  read-load from multiple clients (TODO: stripe read)

Implementation notes:
 A few points introduced in previous patch should be mentioned here:

* Special care was made so absolutlly all operation that have any chance
  of failing are done before any osd-request is executed. This is to
  minimize the need for a data consistency recovery, to only real IO
  errors.

* Each IO state has a kref. It starts at 1, any osd-request executed
  will increment the kref, finally when all are executed the first ref
  is dropped. At IO-done, each request completion decrements the kref,
  the last one to return executes the internal _last_io() routine.
  _last_io() will call the registered io_state_done. On sync mode a
  caller does not supply a done method, indicating a synchronous
  request, the caller is put to sleep and a special io_state_done is
  registered that will awaken the caller. Though also in sync mode all
  operations are executed in parallel.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:23 +02:00
Boaz Harrosh
06886a5a3d exofs: Move all operations to an io_engine
In anticipation for multi-device operations, we separate osd operations
into an abstract I/O API. Currently only one device is used but later
when adding more devices, we will drive all devices in parallel according
to a "data_map" that describes how data is arranged on multiple devices.
The file system level operates, like before, as if there is one object
(inode-number) and an i_size. The io engine will split this to the same
object-number but on multiple device.

At first we introduce Mirror (raid 1) layout. But at the final outcome
we intend to fully implement the pNFS-Objects data-map, including
raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12

* Define an io_state based API for accessing osd storage devices
  in an abstract way.
  Usage:
	First a caller allocates an io state with:
		exofs_get_io_state(struct exofs_sb_info *sbi,
				   struct exofs_io_state** ios);

	Then calles one of:
		exofs_sbi_create(struct exofs_io_state *ios);
		exofs_sbi_remove(struct exofs_io_state *ios);
		exofs_sbi_write(struct exofs_io_state *ios);
		exofs_sbi_read(struct exofs_io_state *ios);
		exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);

	And when done
		exofs_put_io_state(struct exofs_io_state *ios);

* Convert all source files to use this new API
* Convert from bio_alloc to bio_kmalloc
* In io engine we make use of the now fixed osd_req_decode_sense

There are no functional changes or on disk additions after this patch.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:22 +02:00
Boaz Harrosh
8ce9bdd1fb exofs: move osd.c to ios.c
If I do a "git mv" together with a massive code change
and commit in one patch, git looses the rename and
records a delete/new instead. This is bad because I want
a rename recorded so later rebased/cherry-picked patches
to the old name will work. Also the --follow is lost.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:21 +02:00