Commit Graph

4032 Commits

Author SHA1 Message Date
Matthew Wilcox
3dc2916107 dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax_clear_sectors() cannot handle poisoned blocks.  These must be
zeroed using the BIO interface instead.  Convert ext2 and XFS to use
only sb_issue_zerout().

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18 12:16:56 -06:00
Linus Torvalds
e34df3344d Merge branch 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel lookup fixups from Al Viro:
 "Fix for xfs parallel readdir (turns out the cxfs exposure was not
  enough to catch all problems), and a reversion of btrfs back to
  ->iterate() until the fs/btrfs/delayed-inode.c gets fixed"

* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xfs: concurrent readdir hangs on data buffer locks
  Revert "btrfs: switch to ->iterate_shared()"
2016-05-18 10:28:45 -07:00
Dave Chinner
9f54180109 xfs: concurrent readdir hangs on data buffer locks
There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.

process 1               process 2               process 3
---------               ---------               ---------
readdir
iolock(shared)
  get_leaf_dents
    iterate entries
       ilock(shared)
       map, lock and read buffer
       iunlock(shared)
       process entries in buffer
       .....
                                                readdir
                                                iolock(shared)
                                                  get_leaf_dents
                                                    iterate entries
                                                      ilock(shared)
                                                      map, lock buffer
                                                      <blocks>
                        finish ->iterate_shared
                        file_accessed()
                          ->update_time
                            start transaction
                            ilock(excl)
                            <blocks>
        .....
        finishes processing buffer
        get next buffer
          ilock(shared)
          <blocks>

And that's the deadlock.

Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18 13:20:21 -04:00
Dave Chinner
ad438c4038 xfs: move reclaim tagging functions
Rearrange the inode tagging functions so that they are higher up in
xfs_cache.c and so there is no need for forward prototypes to be
defined. This is purely code movement, no other change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-05-18 14:20:08 +10:00
Dave Chinner
545c0889d2 xfs: simplify inode reclaim tagging interfaces
Inode radix tree tagging for reclaim passes a lot of unnecessary
variables around. Over time the xfs-perag has grown a xfs_mount
backpointer, and an internal agno so we don't need to pass other
variables into the tagging functions to supply this information.

Rework the functions to pass the minimal variable set required
and simplify the internal logic and flow.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:11:41 +10:00
Dave Chinner
194293631d xfs: rename variables in xfs_iflush_cluster for clarity
The cluster inode variable uses unconventional naming - iq - which
makes it hard to distinguish it between the inode passed into the
function - ip - and that is a vector for mistakes to be made.
Rename all the cluster inode variables to use a more conventional
prefixes to reduce potential future confusion (cilist, cilist_size,
cip).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:46 +10:00
Dave Chinner
5a90e53e81 xfs: xfs_iflush_cluster has range issues
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inodes beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster inodes once the first inode outside
the cluster mask is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:13 +10:00
Dave Chinner
8a17d7dded xfs: mark reclaimed inodes invalid earlier
The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".

This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.

As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.

For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.

[bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
	  had already been zeroed.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:09:12 +10:00
Dave Chinner
1f2dcfe89e xfs: xfs_inode_free() isn't RCU safe
The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.

Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state.  This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.

The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.

This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.

IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.

To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.

Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:01:53 +10:00
Alex Lyakas
32b43ab6fb xfs: optimise xfs_iext_destroy
When unmounting XFS, we call:

xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy

This goes over the whole indirection array and calls
xfs_iext_irec_remove for each one of the erps (from the last one to
the first one). As a result, we keep shrinking (reallocating
actually) the indirection array until we shrink out all of its
elements. When we have files with huge numbers of extents, umount
takes 30-80 sec, depending on the amount of files that XFS loaded
and the amount of indirection entries of each file. The unmount
stack looks like:

[<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
[<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
[<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
[<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
[<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
[<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
[<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
[<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
[<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
[<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
[<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
[<ffffffff811ea207>] kill_block_super+0x27/0x70
[<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
[<ffffffff811eaaee>] deactivate_super+0x4e/0x70
[<ffffffff81207593>] cleanup_mnt+0x43/0x90
[<ffffffff81207632>] __cleanup_mnt+0x12/0x20
[<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
[<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
[<ffffffff81717c6f>] int_signal+0x12/0x17

Further, this reallocation prevents us from freeing the extent list
from a RCU callback as allocation can block. Hence if the extent
list is in indirect format, optimise the freeing of the extent list
to only use kmem_free calls by freeing entire extent buffer pages at
a time, rather than extent by extent.

[dchinner: simplified freeing loop based on Christoph's suggestion]

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 14:01:52 +10:00
Dave Chinner
7d3aa7fe97 xfs: skip stale inodes in xfs_iflush_cluster
We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.

cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:54:23 +10:00
Dave Chinner
51b07f30a7 xfs: fix inode validity check in xfs_iflush_cluster
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.

(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>

cc: <stable@vger.kernel.org> # 3.10.x-
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:54:22 +10:00
Dave Chinner
b1438f4779 xfs: xfs_iflush_cluster fails to abort on error
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.

Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.

cc: <stable@vger.kernel.org> # 3.10.x-
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:53:42 +10:00
Dave Chinner
8179c03629 xfs: remove xfs_fs_evict_inode()
Joe Lawrence reported a list_add corruption with 4.6-rc1 when
testing some custom md administration code that made it's own
block device nodes for the md array. The simple test loop of:

for i in {0..100}; do
	mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
	mdadm --detail --export $tmp/tmp_node > /dev/null
	rm -f $tmp/tmp_node
done


Would produce this warning in bd_acquire() when mdadm opened the
device node:

list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.

And then produce this from bd_forget from kdevtmpfs evicting a block
dev inode:

list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8

This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
to vfs inode"). The issue is that xfs_inactive() frees the
unlinked inode, and the above commit meant that this freeing zeroed
the mode in the struct inode. The problem is that after evict() has
called ->evict_inode, it expects the i_mode to be intact so that it
can call bd_forget() or cd_forget() to drop the reference to the
block device inode attached to the XFS inode.

In reality, the only thing we do in xfs_fs_evict_inode() that is not
generic is call xfs_inactive(). We can move the xfs_inactive() call
to xfs_fs_destroy_inode() without any problems at all, and this
will leave the VFS inode intact until it is completely done with it.

So, remove xfs_fs_evict_inode(), and do the work it used to do in
->destroy_inode instead.

cc: <stable@vger.kernel.org> # 4.6
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 13:52:42 +10:00
Carlos Maiolino
e6b3bb7896 xfs: add "fail at unmount" error handling configuration
If we take "retry forever" literally on metadata IO errors, we can
hang at unmount, once it retries those writes forever. This is the
default behavior, unfortunately.

Add an error configuration option for this behavior and default it
to "fail" so that an unmount will trigger actuall errors, a shutdown
and allow the unmount to succeed. It will be noisy, though, as it
will log the errors and shutdown that occurs.

To fix this, we need to mark the filesystem as being in the process
of unmounting. Do this with a mount flag that is added at the
appropriate time (i.e. before the blocking AIL sync). We also need
to add this flag if mount fails after the initial phase of log
recovery has been run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:11:27 +10:00
Carlos Maiolino
e0a431b3a3 xfs: add configuration handlers for specific errors
now most of the infrastructure is in place, we can start adding
support for configuring specific errors such as ENODEV, ENOSPC, EIO,
etc. Add these error configurations and configure them all to have
appropriate behaviours. That is, all will be configured to retry
forever by default, except for ENODEV, which is an unrecoverable
error, so it will be configured to not retry on error

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:09:28 +10:00
Carlos Maiolino
a5ea70d25d xfs: add configuration of error failure speed
On reception of an error, we can fail immediately, perform some
bound amount of retries or retry indefinitely. The current behaviour
we have is to retry forever.

However, we'd like the ability to choose how long the filesystem
should try after an error, it can either fail immediately, retry a
few times, or retry forever. This is implemented by using
max_retries sysfs attribute, to hold the amount of times we allow
the filesystem to retry after an error. Being -1 a special case
where the filesystem will retry indefinitely.

Add both a maximum retry count and a retry timeout so that we can
bound by time and/or physical IO attempts.

Finally, plumb these into xfs_buf_iodone error processing so that
the error behaviour follows the selected configuration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:08:15 +10:00
Carlos Maiolino
ef6a50fbb1 xfs: introduce table-based init for error behaviors
Before we start expanding the number of error classes and errors we
can configure behaviour for, we need a simple and clear way to
define the default behaviour that we initialized each mount with.
Introduce a table based method for keeping the initial configuration
in, and apply that to the existing initialization code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:06:44 +10:00
Carlos Maiolino
df3093907c xfs: add configurable error support to metadata buffers
With the error configuration handle for async metadata write errors
in place, we can now add initial support to the IO error processing
in xfs_buf_iodone_error().

Add an infrastructure function to look up the configuration handle,
and rearrange the error handling to prepare the way for different
error handling conigurations to be used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:05:33 +10:00
Carlos Maiolino
ffd40ef697 xfs: introduce metadata IO error class
Now we have the basic infrastructure, add the first error class so
we can build up the infrastructure in a meaningful way. Add the
metadata async write IO error class and sysfs entry, and introduce a
default configuration that matches the existing "retry forever"
behavior for async write metadata buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 11:01:00 +10:00
Carlos Maiolino
192852be8b xfs: configurable error behavior via sysfs
We need to be able to change the way XFS behaviours in error
conditions depending on the type of underlying storage. This is
necessary for handling non-traditional block devices with extended
error cases, such as thin provisioned devices that can return ENOSPC
as an IO error.

Introduce the basic sysfs infrastructure needed to define and
configure error behaviours. This is done to be generic enough to
extend to configuring behaviour in other error conditions, such as
ENOMEM, which also has different desired behaviours according to
machine configuration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 10:58:51 +10:00
Brian Foster
9bdd9bd69b xfs: buffer ->bi_end_io function requires irq-safe lock
Reports have surfaced of a lockdep splat complaining about an
irq-safe -> irq-unsafe locking order in the xfs_buf_bio_end_io() bio
completion handler. This only occurs when I/O errors are present
because bp->b_lock is only acquired in this context to protect
setting an error on the buffer. The problem is that this lock can be
acquired with the (request_queue) q->queue_lock held. See
scsi_end_request() or ata_qc_schedule_eh(), for example.

Replace the locked test/set of b_io_error with a cmpxchg() call.
This eliminates the need for the lock and thus the lock ordering
problem goes away.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18 10:56:41 +10:00
Linus Torvalds
c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Toshi Kani
1e937cddd1 xfs: Add alignment check for DAX mount
When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:12 -06:00
Jan Kara
02fbd13975 dax: Remove complete_unwritten argument
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16 18:11:51 -06:00
Al Viro
3b0a3c1ac1 simple local filesystems: switch to ->iterate_shared()
no changes needed (XFS isn't simple, but it has the same parallelism
in the interesting parts exercised from CXFS).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:32 -04:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
13712713ca xfs: eliminate the pos variable in xfs_file_dio_aio_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Al Viro
b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Eryu Guan
6e3e6d55e5 xfs: mute some sparse warnings
These three warnings are fixed:

fs/xfs/xfs_inode.c:1033:44: warning: Using plain integer as NULL pointer
fs/xfs/xfs_inode_item.c:525:20: warning: context imbalance in 'xfs_inode_item_push' - unexpected unlock
fs/xfs/xfs_dquot.c:696:1: warning: symbol 'xfs_dq_get_next_id' was not declared. Should it be static?

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:47:21 +10:00
Christoph Hellwig
664b60f6ba xfs: improve kmem_realloc
Use krealloc to implement our realloc function.  This helps to avoid
new allocations if we are still in the slab bucket.  At least for the
bmap btree root that's actually the common case.

This also allows removing the now unused oldsize argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:47:01 +10:00
Carlos Maiolino
9f27889f3a xfs: Add caller function output to xfs_log_force tracepoint
I had sent this patch yesterday, but for some reason it didn't reach
xfs list, sending again.

Output the caller of xfs_log_force might be useful when tracing log
checkpoint problems without the need to build kernel with DEBUG.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:46:30 +10:00
Christoph Hellwig
710b1e2c29 xfs: remove transaction types
These aren't used for CIL-style logging and can be dropped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:20:36 +10:00
Christoph Hellwig
253f4911f2 xfs: better xfs_trans_alloc interface
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode.  The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:19:55 +10:00
Christoph Hellwig
0e51a8e191 xfs: optimize bio handling in the buffer writeback path
This patch implements two closely related changes:  First it embeds
a bio the ioend structure so that we don't have to allocate one
separately.  Second it uses the block layer bio chaining mechanism
to chain additional bios off this first one if needed instead of
manually accounting for multiple bio completions in the ioend
structure.  Together this removes a memory allocation per ioend and
greatly simplifies the ioend setup and I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:34:30 +10:00
Dave Chinner
37992c18bb xfs: don't release bios on completion immediately
Completion of an ioend requires us to walk the bufferhead list to
end writback on all the bufferheads. This, in turn, is needed so
that we can end writeback on all the pages we just did IO on.

To remove our dependency on bufferheads in writeback, we need to
turn this around the other way - we need to walk the pages we've
just completed IO on, and then walk the buffers attached to the
pages and complete their IO. In doing this, we remove the
requirement for the ioend to track bufferheads directly.

To enable IO completion to walk all the pages we've submitted IO on,
we need to keep the bios that we used for IO around until the ioend
has been completed. We can do this simply by chaining the bios to
the ioend at completion time, and then walking their pages directly
just before destroying the ioend.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: changed the xfs_finish_page_writeback calling convention]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:12:28 +10:00
Dave Chinner
bb18782aa4 xfs: build bios directly in xfs_add_to_ioend
Currently adding a buffer to the ioend and then building a bio from
the buffer list are two separate operations. We don't build the bios
and submit them until the ioend is submitted, and this places a
fixed dependency on bufferhead chaining in the ioend.

The first step to removing the bufferhead chaining in the ioend is
on the IO submission side. We can build the bio directly as we add
the buffers to the ioend chain, thereby removing the need for a
latter "buffer-to-bio" submission loop. This allows us to submit
bios on large ioends as soon as we cannot add more data to the bio.

These bios then get captured by the active plug, and hence will be
dispatched as soon as either the plug overflows or we schedule away
from the writeback context. This will reduce submission latency for
large IOs, but will also allow more timely request queue based
writeback blocking when the device becomes congested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: various small updates]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:11:25 +10:00
Eric Sandeen
3ab3ffcaca xfs: collapse cases in xfs_attr3_leaf_list_int
Consolidate the 2 calls to ->put_listent in
xfs_attr3_leaf_list_int(), by setting up name, namelen, and
valuelen for the local vs remote cases, then call ->put_listent
and do the error handling all in one spot.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2016-04-06 07:57:47 +10:00
Eric Sandeen
7af5ad28a6 xfs: remove put_value from attr ->put_listent context
The put_value context member is never set; remove it
and the conditional test in xfs_attr3_leaf_list_int().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:57:45 +10:00
Eric Sandeen
e5bd12bfea xfs: don't pass value into attr ->put_listent
The value is not used; only names and value lengths are
returned.  Remove the argument.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:57:32 +10:00
Eric Sandeen
2a6fba6d23 xfs: only return -errno or success from attr ->put_listent
Today, the put_listent formatters return either 1 or 0; if
they return 1, some callers treat this as an error and return
it up the stack, despite "1" not being a valid (negative)
error code.

The intent seems to be that if the input buffer is full,
we set seen_enough or set count = -1, and return 1;
but some callers check the return before checking the
seen_enough or count fields of the context.

Fix this by only returning non-zero for actual errors
encountered, and rely on the caller to first check the
return value, then check the values in the context to
decide what to do.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:57:18 +10:00
Christoph Hellwig
30ee052e12 xfs: optimize inline symlinks
By overallocating the in-core inode fork data buffer and zero
terminating the link target in xfs_init_local_fork we can avoid
the memory allocation in ->follow_link.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:53:29 +10:00
Christoph Hellwig
bfe8804d90 xfs: use ->readlink to implement the readlink_by_handle ioctl
Also drop the now unused readlink_copy export.

[dchinner: use d_inode(dentry) rather than dentry->d_inode]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:50:54 +10:00
Christoph Hellwig
2b3d1d41b4 xfs: set up inode operation vectors later
In the next patch we'll set up different inode operations for inline vs
out of line symlinks, for that we need to make sure the flags are already
set up properly.

[dchinner: added xfs_setup_iops() call to xfs_rename_alloc_whiteout()]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:48:27 +10:00
Christoph Hellwig
143f4aede7 xfs: factor out a helper to initialize a local format inode fork
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:41:43 +10:00
Eryu Guan
ce5c767db0 xfs: add missing break in xfs_parseargs()
Commit 2e74af0e11 ("xfs: convert mount option parsing to tokens")
missed a 'break;' in xfs_parseargs() which causes mount to fail with
"-o pqnoenforce" option when mounting a v4 filesystem. xfs/050
catches this failure:

XFS (vda6): Super block does not support project and group quota together

Fixes: 2e74af0e11 ("xfs: convert mount option parsing to tokens")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:19:40 +10:00
Dave Chinner
ad747e3b29 xfs: Don't wrap growfs AGFL indexes
Commit 96f859d ("libxfs: pack the agfl header structure so
XFS_AGFL_SIZE is correct") allowed the freelist to use the empty
slot at the end of the freelist on 64 bit systems that was not
being used due to sizeof() rounding up the structure size.

This has caused versions of xfs_repair prior to 4.5.0 (which also
has the fix) to report this as a corruption once the filesystem has
been grown. Older kernels can also have problems (seen from a whacky
container/vm management environment) mounting filesystems grown on a
system with a newer kernel than the vm/container it is deployed on.

To avoid this problem, change the initial free list indexes not to
wrap across the end of the AGFL, hence avoiding the initialisation
of agf_fllast to the last index in the AGFL.

cc: <stable@vger.kernel.org> # 4.4-4.5
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:06:20 +10:00
Eric Sandeen
d0a58e8339 xfs: disallow rw remount on fs with unknown ro-compat features
Today, a kernel which refuses to mount a filesystem read-write
due to unknown ro-compat features can still transition to read-write
via the remount path.  The old kernel is most likely none the wiser,
because it's unaware of the new feature, and isn't using it.  However,
writing to the filesystem may well corrupt metadata related to that
new feature, and moving to a newer kernel which understand the feature
will have problems.

Right now the only ro-compat feature we have is the free inode btree,
which showed up in v3.16.  It would be good to push this back to
all the active stable kernels, I think, so that if anyone is using
newer mkfs (which enables the finobt feature) with older kernel
releases, they'll be protected.

Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 07:05:41 +10:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Andreas Gruenbacher
b8a7a3a667 posix_acl: Inode acl caching fixes
When get_acl() is called for an inode whose ACL is not cached yet, the
get_acl inode operation is called to fetch the ACL from the filesystem.
The inode operation is responsible for updating the cached acl with
set_cached_acl().  This is done without locking at the VFS level, so
another task can call set_cached_acl() or forget_cached_acl() before the
get_acl inode operation gets to calling set_cached_acl(), and then
get_acl's call to set_cached_acl() results in caching an outdate ACL.

Prevent this from happening by setting the cached ACL pointer to a
task-specific sentinel value before calling the get_acl inode operation.
Move the responsibility for updating the cached ACL from the get_acl
inode operations to get_acl().  There, only set the cached ACL if the
sentinel value hasn't changed.

The sentinel values are chosen to have odd values.  Likewise, the value
of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
an even value (ACLs are aligned in memory).  This allows to distinguish
uncached ACLs values from ACL objects.

In addition, switch from guarding inode->i_acl and inode->i_default_acl
upates by the inode->i_lock spinlock to using xchg() and cmpxchg().

Filesystems that do not want ACLs returned from their get_acl inode
operations to be cached must call forget_cached_acl() to prevent the VFS
from doing so.

(Patch written by Al Viro and Andreas Gruenbacher.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-31 00:30:15 -04:00
Linus Torvalds
8b306a2e7c Various bugfixes, a RDMA update from Chuck Lever, and support for a new
pnfs layout type from Christoph Hellwig.  The new layout type is a
 variant of the block layout which uses SCSI features to offer improved
 fencing and device identification.
 
 Note this pull request also includes the client side of SCSI layout,
 with Trond's permission.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW9D0/AAoJECebzXlCjuG+fYcP/ibluAOSRrQ523gQcJNS+QSV
 3B7YY6diJkfQNkm4oAROwPd1KHT2qhoVAO3JHXA3SZnjVVYQxAHeh2wsZJ2jL6Ft
 uyZARxix+F9alJVT3S+uYLwagjh9LXLhb0MaRTMheaWGsPKLQTU4JtsLsjAIhCah
 R0EIIdQfWcb83XoVPmiflVO4Nl/TQWmfA5wHfoVtITJcL3AaC9gzCGNbc8dHLnFC
 HRjGVgHr3nSL3suvUEFfxSEo4QoNPWIX4kBaWXgqbVgOQqmbtQtaXdnd3gIRtkzj
 9Q/lxiwaArtDjdAQdyNtRRBUpkpWo+xWp/vpnNUxTXKoRtpSyqYQX5FaPCPRVAAp
 GYGw2qHrvWn2hSajtVtKyWwsQ3lYsDmbkxAkgScO9kQdS+kuxNyIzYIEvakdtFyJ
 txFsauJczkNNFeHKzLPDoGbuX7KB/+pUsjmX5nYtMhwRriXA5S8zcO4AvTrmTPDF
 vQrLM97mqI60LWmpQUO1OE8CEFPVx5DUZ0KdLMvFNKPZph8BTPJxJMmxJK4R6stV
 /TWglRTEO8IGUh0ww8+3PfMfxVG5XHnQc99+VGVZOS9hJ4GOXbWYAqZ0m+sRJ2Pi
 JPawILie5x2gH1FrVYbcTZsQzdmdn/BF9yePNzAkMucjuEUHXFTlf3MMfEhKpYTl
 0l8LBCv6ZvtGU+PUJxZn
 =MToz
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux

Pull more nfsd updates from Bruce Fields:
 "Apologies for the previous request, which omitted the top 8 commits
  from my for-next branch (including the SCSI layout commits).  Thanks
  to Trond for spotting my error!"

This actually includes the new layout types, so here's that part of
the pull message repeated:

 "Support for a new pnfs layout type from Christoph Hellwig.  The new
  layout type is a variant of the block layout which uses SCSI features
  to offer improved fencing and device identification.

  Note this pull request also includes the client side of SCSI layout,
  with Trond's permission"

* tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: use short read as well as i_size to set eof
  nfsd: better layoutupdate bounds-checking
  nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK
  nfsd: add SCSI layout support
  nfsd: move some blocklayout code
  nfsd: add a new config option for the block layout driver
  nfs/blocklayout: add SCSI layout support
  nfs4.h: add SCSI layout definitions
2016-03-24 19:50:32 -07:00
Linus Torvalds
53d2e6976b xfs: Changes for 4.6-rc1
Change summary:
 o error propagation for direct IO failures fixes for both XFS and ext4
 o new quota interfaces and XFS implementation for iterating all the quota IDs
   in the filesystem
 o locking fixes for real-time device extent allocation
 o reduction of duplicate information in the xfs and vfs inode, saving roughly
   100 bytes of memory per cached inode.
 o buffer flag cleanup
 o rework of the writepage code to use the generic write clustering mechanisms
 o several fixes for inode flag based DAX enablement
 o rework of remount option parsing
 o compile time verification of on-disk format structure sizes
 o delayed allocation reservation overrun fixes
 o lots of little error handling fixes
 o small memory leak fixes
 o enable xfsaild freezing again
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW71DQAAoJEK3oKUf0dfodyiwP/0Tou9f1huzLC0kd7kmEoKKC
 BWQmtJGEdo0iSpJNZhg/EJmjvRtbBiOB9CRcEyG8d71kqZ+MKW7t/4JjNvNG34aE
 vHjhwMBVVqkw/q6azi2LiEDsVcOe5bXxUrXNZi18/09OAl4pHm+X8VERLnnC5y+i
 QIHAOdB5R+36cXcceJm1HR6jTZedbNdQkT/ndhm5S60FGhvVI29cs9NwYwoi5aif
 O55r6krSWBj6U/X6MsLvr+lNb6+1Sd1hyE8dGTE7lOUX/crFIysaDPEuQmWvDjsO
 M1ulVfzKoBJHcyvpbdHwdBEyiBjzvETcrgndMRoWOjZiOLqNtWYsgIEiC+Nlidwd
 +T4XhkJJJg5UUQ4r6Hs85SQn/THanzR5KoN5nbTsFtFkCKw1DRkUSNuh2mXP2xVG
 JcNDCjDvvHG76EfQ1otlYf7ru79Ck+hjVs+szaEVPpOzAwz8yOtD+L7I8f73gQ6a
 ayP8W2oZQpYvQRv+smgvt+HwQA4fNJk9ZseY3QD5+z5snJz7JEhZogqW+ngFYkNQ
 dtA5Y7gpTkKfo3mKO0XmE5+3fcSXhGHGYQzmUgJFlgWTK7+E8fuDhn6D66wFcZSq
 QhyRk9J7Xb7ZWuP5PlOkxb9DLd4hnuyie2bYw/0hVtOatjE/Em4gRJ3Oq3ZANwZx
 OeMGj4Uyb3/MKAJwy3Gq
 =ZoiX
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There's quite a lot in this request, and there's some cross-over with
  ext4, dax and quota code due to the nature of the changes being made.

  As for the rest of the XFS changes, there are lots of little things
  all over the place, which add up to a lot of changes in the end.

  The major changes are that we've reduced the size of the struct
  xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
  >10%), the writepage code now only does a single set of mapping tree
  lockups so uses less CPU, delayed allocation reservations won't
  overrun under random write loads anymore, and we added compile time
  verification for on-disk structure sizes so we find out when a commit
  or platform/compiler change breaks the on disk structure as early as
  possible.

  Change summary:

   - error propagation for direct IO failures fixes for both XFS and
     ext4
   - new quota interfaces and XFS implementation for iterating all the
     quota IDs in the filesystem
   - locking fixes for real-time device extent allocation
   - reduction of duplicate information in the xfs and vfs inode, saving
     roughly 100 bytes of memory per cached inode.
   - buffer flag cleanup
   - rework of the writepage code to use the generic write clustering
     mechanisms
   - several fixes for inode flag based DAX enablement
   - rework of remount option parsing
   - compile time verification of on-disk format structure sizes
   - delayed allocation reservation overrun fixes
   - lots of little error handling fixes
   - small memory leak fixes
   - enable xfsaild freezing again"

* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
  xfs: always set rvalp in xfs_dir2_node_trim_free
  xfs: ensure committed is initialized in xfs_trans_roll
  xfs: borrow indirect blocks from freed extent when available
  xfs: refactor delalloc indlen reservation split into helper
  xfs: update freeblocks counter after extent deletion
  xfs: debug mode forced buffered write failure
  xfs: remove impossible condition
  xfs: check sizes of XFS on-disk structures at compile time
  xfs: ioends require logically contiguous file offsets
  xfs: use named array initializers for log item dumping
  xfs: fix computation of inode btree maxlevels
  xfs: reinitialise per-AG structures if geometry changes during recovery
  xfs: remove xfs_trans_get_block_res
  xfs: fix up inode32/64 (re)mount handling
  xfs: fix format specifier , should be %llx and not %llu
  xfs: sanitize remount options
  xfs: convert mount option parsing to tokens
  xfs: fix two memory leaks in xfs_attr_list.c error paths
  xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
  xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
  ...
2016-03-21 11:53:05 -07:00
Christoph Hellwig
f99d4fbdae nfsd: add SCSI layout support
This is a simple extension to the block layout driver to use SCSI
persistent reservations for access control and fencing, as well as
SCSI VPD pages for device identification.

For this we need to pass the nfs4_client to the proc_getdeviceinfo method
to generate the reservation key, and add a new fence_client method
to allow for fence actions in the layout driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-18 11:42:53 -04:00
Christoph Hellwig
81c3932901 nfsd: add a new config option for the block layout driver
Split the config symbols into a generic pNFS one, which is invisible
and gets selected by the layout drivers, and one for the block layout
driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-18 11:40:57 -04:00
Johannes Weiner
62cccb8c8e mm: simplify lock_page_memcg()
Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Johannes Weiner
81f8c3a461 mm: memcontrol: generalize locking for the page->mem_cgroup binding
These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
work properly and be as adaptive to new cache workingsets as global
reclaim already is.

This should have been part of the original thrash detection patch
series, but was deferred due to the complexity of those patches.

This patch (of 5):

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat().  But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg().  Since charge migration is a cgroup1 feature only,
we might be able to delete it at some point, and these now easy to
identify locking sites along with it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Dave Chinner
2cdb958aba Merge branch 'xfs-misc-fixes-4.6-4' into for-next 2016-03-15 11:44:35 +11:00
Christoph Hellwig
355cced452 xfs: always set rvalp in xfs_dir2_node_trim_free
xfs_dir2_node_trim_free can return with setting the rvalp argument
pointer.  Initialize it to 0 at the beginning of the function and
only update it to 1 if we succeeded trimming a freespace block.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:44:18 +11:00
Eric Sandeen
cc07eed833 xfs: ensure committed is initialized in xfs_trans_roll
__xfs_trans_roll() can return without setting the
*committed argument; this was a problem for xfs_bmap_finish():

        int       committed;/* xact committed or not */
...
        error = __xfs_trans_roll(tp, ip, &committed);
        if (error) {
...
                if (committed) {

and we tested an uninitialized "committed" variable on the
error path.  No caller is preserving "committed" state across
calls to __xfs_trans_roll(), so just initialize committed inside
the function to avoid future errors like this.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:47 +11:00
Brian Foster
d34999c97a xfs: borrow indirect blocks from freed extent when available
xfs_bmap_del_extent() handles extent removal from the in-core and
on-disk extent lists. When removing a delalloc range, it updates the
indirect block reservation appropriately based on the removal. It
currently enforces that the new indirect block reservation is less than
or equal to the original. This is normally the case in all situations
except for in certain cases when the removed range creates a hole in a
single delalloc extent, thus splitting a single delalloc extent in two.

It is possible with small enough extents to split an indlen==1 extent
into two such slightly smaller extents. This leaves one extent with 0
indirect blocks and leads to assert failures in other areas (e.g.,
xfs_bunmapi() if the extent happens to be removed).

Update the indlen distribution code to steal blocks from the deleted
extent, if necessary, to satisfy the worst case total indirect
reservation for the new extents. This is safe as the caller does not
update the fdblocks counters until the extent is removed. Blocks stolen
in this manner simply remain accounted as allocated, having ownership
transferred from the data extent to an indirect reservation.

As a precaution, fall back to the original reservation algorithm if the
new indlen requirement is not met and warn if we end up with extents
without any reservation at all to detect this more easily in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:47 +11:00
Brian Foster
a9bd24ac2b xfs: refactor delalloc indlen reservation split into helper
The delayed allocation indirect reservation splitting code is not
sufficient in some cases where a delalloc extent is split in two. In
preparation for enhancements to this code, refactor the current indlen
distribution algorithm into a new helper function.

[dchinner: rename temp, temp2 variables]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:46 +11:00
Brian Foster
b2706a05ba xfs: update freeblocks counter after extent deletion
xfs_bunmapi() currently updates the fdblocks counter, unreserves quota,
etc. before the extent is deleted by xfs_bmap_del_extent(). The function
has problems dividing up the indirect reserved blocks for scenarios
where a single delalloc extent is split in two. Particularly, there
aren't always enough blocks reserved for multiple extents in a single
extent reservation.

The solution to this problem is to allow the extent removal code to
steal from the deleted extent to meet indirect reservation requirements.
Move the block of code in xfs_bmapi() that updates the fdblocks counter
to after the call to xfs_bmap_del_extent() to allow the codepath to
update the extent record before the free blocks are accounted. Also,
reshuffle the code slightly so the delalloc accounting occurs near the
xfs_bmap_del_extent() call to provide context for the comments.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:46 +11:00
Brian Foster
801cc4e17a xfs: debug mode forced buffered write failure
Add a DEBUG mode-only sysfs knob to enable forced buffered write
failure. An additional side effect of this mode is brute force killing
of delayed allocation blocks in the range of the write. The latter is
the prime motiviation behind this patch, as userspace test
infrastructure requires a reliable mechanism to create and split
delalloc extents without causing extent conversion.

Certain fallocate operations (i.e., zero range) were used for this in
the past, but the implementations have changed such that delalloc
extents are flushed and converted to real blocks, rendering the test
useless.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:44 +11:00
Linus Torvalds
2a62ec0af2 xfs: fixes for 4.5-rc7
Changes:
 
 o Only perform torn log write detection on dirty logs. This prevents
   failures being detected due to a clean filesystem being moved
   between machines or kernels of different architectures (e.g. 32
   -> 64 bit, BE -> LE, etc). This fixes a regression introduced by
   the torn log write detection in 4.5-rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW4fdHAAoJEK3oKUf0dfod/6EP/1Mi2K+z8t9FaevB0yiy+Yfs
 CzRe2Sim5EF67IFeh1CBChcJ4dpUtVxwn+vM6/tfOWM8jS0Oo1Chr5woRm2Xc1Ko
 O4xmLcoooIBeustVt12/3+lKR0ACY4tSq8V673wBp7tSFi4dj5cnpb2pDuQTio3q
 JCTFtHkG7s5d2XnDn0dYVdrm7/eKB1ZdQCaVxikVtqQvdwrnyZpo0Q5iu5/Ync4H
 ULOoMW1xrrJQ7bZcMq4uLM9GglUEB2/tPfT2jFtiUFaNo+420B7FzZR9e6P9giBV
 JB/t02uiqicN0+WN9xyu+ohYMtjUZ2wrysLaX8P9szy/Rmsn7gOUYs946KUhullD
 D5JFzB/IUrLnIhfY4il8bK6NoTLPCj9DlktaA7GikA7QAyZFLrRr3b1r/XbR2lDB
 8Sy3ij7yKh2fhThOk4D6fxyVkSgKpr9E2gz6LSl45imbrj69IjXCJwadD1i7yB8j
 VJj+Vr54DcjxFR0SnCrpGSG2i7fgkGk+8woIyVkPczPMpVlmQrpnmBbD0+fn4d31
 aRX4aDmv7OsT+OKEoy9Hu3wRmfUZSmaRmp+2QdJ0dT98LEFoUCmhsaiJLL+nVgv0
 tsApndnvAFxWHZZ9w5VPnJ/99YIvWpb3zzn6mKD3XfN/2Mf4sMcN2JTzxLgEdU9D
 2JX+S1/AUMZfL0Ghaww8
 =NDeH
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "This is a fix for a regression introduced in 4.5-rc1 by the new torn
  log write detection code.  The regression only affects people moving a
  clean filesystem between machines/kernels of different architecture
  (such as changing between 32 bit and 64 bit kernels), but this is the
  recommended (and only!) safe way to migrate a filesystem between
  architectures so we really need to ensure it works.

  The changes are larger than I'd prefer right at the end of the release
  cycle, but the majority of the change is just factoring code to enable
  the detection of a clean log at the correct time to avoid this issue.

  Changes:

   - Only perform torn log write detection on dirty logs.  This prevents
     failures being detected due to a clean filesystem being moved
     between machines or kernels of different architectures (e.g.  32 ->
     64 bit, BE -> LE, etc).  This fixes a regression introduced by the
     torn log write detection in 4.5-rc1"

* tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: only run torn log write detection on dirty logs
  xfs: refactor in-core log state update to helper
  xfs: refactor unmount record detection into helper
  xfs: separate log head record discovery from verification
2016-03-11 10:21:32 -08:00
Dave Chinner
ab9d1e4f7b Merge branch 'xfs-misc-fixes-4.6-3' into for-next 2016-03-09 08:18:30 +11:00
Luis de Bethencourt
a5fd276bdc xfs: remove impossible condition
bp_release is set to 0 just before the breakpoint of the for loop before
the conditional check (in line 458). The other breakpoint is a goto that
skips the dead code.

Addresses-Coverity-Id: 102338

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:17:56 +11:00
Darrick J. Wong
30cbc591c3 xfs: check sizes of XFS on-disk structures at compile time
Check the sizes of XFS on-disk structures when compiling the kernel.
Use this to catch inadvertent changes in structure size due to padding
and alignment issues, etc.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:15:14 +11:00
Dave Chinner
3c1a79f5ff Merge branch 'xfs-misc-fixes-4.6-2' into for-next 2016-03-07 09:34:54 +11:00
Dave Chinner
85a9f38d38 Merge branch 'xfs-dax-fixes-4.6' into for-next 2016-03-07 09:34:31 +11:00
Dave Chinner
3d93ec0364 Merge branch 'xfs-writepage-rework-4.6' into for-next 2016-03-07 09:34:02 +11:00
Darrick J. Wong
0df61da8ac xfs: ioends require logically contiguous file offsets
We need to create a new ioend if the current writepage call isn't
logically contiguous with the range contained in the previous ioend.
Hopefully writepage gets called in order of increasing file offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 09:32:14 +11:00
Dave Chinner
7f0ed5461a Merge branch 'xfs-buf-macro-cleanup-4.6' into for-next 2016-03-07 09:31:00 +11:00
Dave Chinner
a2bbcb60ff Merge branch 'xfs-gut-icdinode-4.6' into for-next 2016-03-07 09:30:32 +11:00
Dave Chinner
6d247d47fb Merge branch 'xfs-misc-fixes-4.6' into for-next 2016-03-07 09:30:12 +11:00
Dave Chinner
acb3e26fc3 Merge branch 'xfs-dio-fix-4.6' into for-next 2016-03-07 09:29:48 +11:00
Dave Chinner
1b186d25b0 Merge branch 'xfs-get-next-dquot-4.6' into for-next 2016-03-07 09:29:25 +11:00
Dave Chinner
c53473be45 Merge branch 'xfs-rt-fixes-4.6' into for-next 2016-03-07 09:29:04 +11:00
Dave Chinner
9deed09554 Merge branch 'xfs-torn-log-fixes-4.5' into for-next 2016-03-07 09:28:36 +11:00
Darrick J. Wong
5110cd82ca xfs: use named array initializers for log item dumping
Use named array initializers for the string arrays used to dump log
items, rather than depending on the order being maintained correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:40:03 +11:00
Darrick J. Wong
49ca9118e6 xfs: fix computation of inode btree maxlevels
Commit 88740da18[1] introduced a function to compute the maximum
height of the inode btree back in 1994.  Back then, apparently, the
freespace and inode btrees shared the same geometry; however, it has
long since been the case that the inode and freespace btrees have
different record and key sizes.  Therefore, we must use m_inobt_mnr if
we want a correct calculation/log reservation/etc.

(Yes, this bug has been around for 21 years and ten months.)

(Yes, I was in middle school when this bug was committed.)

[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd

Historical-research-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:39:56 +11:00
Dave Chinner
a798011c8f xfs: reinitialise per-AG structures if geometry changes during recovery
If a crash occurs immediately after a filesystem grow operation, the
updated superblock geometry is found only in the log. After we
recover the log, the superblock is reread and re-initialised and so
has the new geometry in memory. If the new geometry has more AGs
than prior to the grow operation, then the new AGs will not have
in-memory xfs_perag structurea associated with them.

This will result in an oops when the first metadata buffer from a
new AG is looked up in the buffer cache, as the block lies within
the new geometry but then fails to find a perag structure on lookup.
This is easily fixed by simply re-initialising the perag structure
after re-reading the superblock at the conclusion of the first pahse
of log recovery.

This, however, does not fix the case of log recovery requiring
access to metadata in the newly grown space. Fortunately for us,
because the in-core superblock has not been updated, this will
result in detection of access beyond the end of the filesystem
and so recovery will fail at that point. If this proves to be
a problem, then we can address it separately to the current
reported issue.

Reported-by: Alex Lyakas <alex@zadarastorage.com>
Tested-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-03-07 08:39:36 +11:00
Brian Foster
7f6aff3a29 xfs: only run torn log write detection on dirty logs
XFS uses CRC verification over a sub-range of the head of the log to
detect and handle torn writes. This torn log write detection currently
runs unconditionally at mount time, regardless of whether the log is
dirty or clean. This is problematic in cases where a filesystem might
end up being moved across different, incompatible (i.e., opposite
byte-endianness) architectures.

The problem lies in the fact that log data is not necessarily written in
an architecture independent format. For example, certain bits of data
are written in native endian format. Further, the size of certain log
data structures differs (i.e., struct xlog_rec_header) depending on the
word size of the cpu. This leads to false positive crc verification
errors and ultimately failed mounts when a cleanly unmounted filesystem
is mounted on a system with an incompatible architecture from data that
was written near the head of the log.

Update the log head/tail discovery code to run torn write detection only
when the log is not clean. This means something other than an unmount
record resides at the head of the log and log recovery is imminent. It
is a requirement to run log recovery on the same type of host that had
written the content of the dirty log and therefore CRC failures are
legitimate corruptions in that scenario.

Reported-by: Jan Beulich <JBeulich@suse.com>
Tested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
717bc0ebca xfs: refactor in-core log state update to helper
Once the record at the head of the log is identified and verified, the
in-core log state is updated based on the record. This includes
information such as the current head block and cycle, the start block of
the last record written to the log, the tail lsn, etc.

Once torn write detection is conditional, this logic will need to be
reused. Factor the code to update the in-core log data structures into a
new helper function. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
65b99a08b3 xfs: refactor unmount record detection into helper
Once the mount sequence has identified the head and tail blocks of the
physical log, the record at the head of the log is located and examined
for an unmount record to determine if the log is clean. This currently
occurs after torn write verification of the head region of the log.

This must ultimately be separated from torn write verification and may
need to be called again if the log head is walked back due to a torn
write (to determine whether the new head record is an unmount record).
Separate this logic into a new helper function. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
82ff6cc26e xfs: separate log head record discovery from verification
The code that locates the log record at the head of the log is buried in
the log head verification function. This is fine when torn write
verification occurs unconditionally, but this behavior is problematic
for filesystems that might be moved across systems with different
architectures.

In preparation for separating examination of the log head for unmount
records from torn write detection, lift the record location logic out of
the log verification function and into the caller. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Christoph Hellwig
a7e5d03ba8 xfs: remove xfs_trans_get_block_res
Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:21 +11:00
Eric Sandeen
12c3f05c7b xfs: fix up inode32/64 (re)mount handling
inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.

The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist.  Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.

Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size.  Today that does not happen.

This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:09 +11:00
Colin Ian King
5d518bd6ce xfs: fix format specifier , should be %llx and not %llu
busyp->bno is printed with a %llu format specifier when the
intention is to print a hexadecimal value. Trivial fix to
use %llx instead.  Found with smatch static analysis:

fs/xfs/xfs_discard.c:229 xfs_discard_extents() warn: '0x'
  prefix is confusing together with '%llu' specifier

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:57:04 +11:00
Eric Sandeen
a08ee40a79 xfs: sanitize remount options
Perform basic sanitization of remount options by
passing the option string and a dummy mount structure
through xfs_parseargs and returning the result.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:56:31 +11:00
Eric Sandeen
2e74af0e11 xfs: convert mount option parsing to tokens
This should be a no-op change, just switch to token parsing
like every other respectable filesystem does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:55:38 +11:00
Mateusz Guzik
2e83b79b2d xfs: fix two memory leaks in xfs_attr_list.c error paths
This plugs 2 trivial leaks in xfs_attr_shortform_list and
xfs_attr3_leaf_list_int.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:51:09 +11:00
Dave Chinner
6448543735 xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
If the block size of a filesystem is not at least PAGE_SIZEd, then
at this point in time DAX cannot be used due to the fact we can't
guarantee extents are page sized or aligned without further work.
Hence disallow setting the DAX flag on an inode if the block size is
too small. Also, be defensive and check the block size when reading
an inode in off disk.

In future, we want to allow DAX to work on any filesystem, so this
is temporary while we sort of the correct conbination of extent size
hints and allocation alignment configurations needed to guarantee
page sized and aligned extent allocation for DAX enabled files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
3a6a854a82 xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
When we set or clear the XFS_DIFLAG2_DAX flag, we should also
set/clear the S_DAX flag in the VFS inode. To do this, we need to
ensure that we first flush and remove any cached entries in the
radix tree to ensure the correct data access method is used when we
next try to read or write data. We ahve to be especially careful
here to lock out page faults so they don't race with the flush and
invalidation before we change the access mode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
db10c697b4 xfs: S_DAX is only for regular files
Only regular files can use DAX for data operations, so we should
restrict setting it on the VFS inode to regular files. Setting it on
metadata inodes may cause the VFS to do the wrong thing for such
inodes, so avoid potential problems by restricting the scope of the
flag to what we know is supported.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
e889752905 xfs: XFS_DIFLAG_DAX is only for regular files or directories
Only file data can use DAX, so we should onyl be able to set this
flag on regular files. However, the flag also serves as an "inherit"
flag at file create time when set on directories, so limit the
FS_IOC_FSSETXATTR ioctl to only set this flag on regular files and
directories.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Ross Zwisler
7f6d5b529b dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
20a90f5899 dax: give DAX clearing code correct bdev
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases.  This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block.  This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Dave Chinner
e10de3723c xfs: don't chain ioends during writepage submission
Currently we can build a long ioend chain during ->writepages that
gets attached to the writepage context. IO submission only then
occurs when we finish all the writepage processing. This means we
can have many ioends allocated and pending, and this violates the
mempool guarantees that we need to give about forwards progress.
i.e. we really should only have one ioend being built at a time,
otherwise we may drain the mempool trying to allocate a new ioend
and that blocks submission, completion and freeing of ioends that
are already in progress.

To prevent this situation from happening, we need to submit ioends
for IO as soon as they are ready for dispatch rather than queuing
them for later submission. This means the ioends have bios built
immediately and they get queued on any plug that is current active.
Hence if we schedule away from writeback, the ioends that have been
built will make forwards progress due to the plug flushing on
context switch. This will also prevent context switches from
creating unnecessary IO submission latency.

We can't completely avoid having nested IO allocation - when we have
a block size smaller than a page size, we still need to hold the
ioend submission until after we have marked the current page dirty.
Hence we may need multiple ioends to be held while the current page
is completely mapped and made ready for IO dispatch. We cannot avoid
this problem - the current code already has this ioend chaining
within a page so we can mostly ignore that it occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:23:12 +11:00
Dave Chinner
bfce7d2e2d xfs: factor mapping out of xfs_do_writepage
Separate out the bufferhead based mapping from the writepage code so
that we have a clear separation of the page operations and the
bufferhead state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:37 +11:00
Dave Chinner
ad68972acb xfs: xfs_cluster_write is redundant
xfs_cluster_write() is not necessary now that xfs_vm_writepages()
aggregates writepage calls across a single mapping. This means we no
longer need to do page lookups in xfs_cluster_write, so writeback
only needs to look up th epage cache once per page being written.
This also removes a large amount of mostly duplicate code between
xfs_do_writepage() and xfs_convert_page().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:31 +11:00
Dave Chinner
fbcc025613 xfs: Introduce writeback context for writepages
xfs_vm_writepages() calls generic_writepages to writeback a range of
a file, but then xfs_vm_writepage() clusters pages itself as it does
not have any context it can pass between->writepage calls from
__write_cache_pages().

Introduce a writeback context for xfs_vm_writepages() and call
__write_cache_pages directly with our own writepage callback so that
we can pass that context to each writepage invocation. This
encapsulates the current mapping, whether it is valid or not, the
current ioend and it's IO type and the ioend chain being built.

This requires us to move the ioend submission up to the level where
the writepage context is declared. This does mean we do not submit
IO until we packaged the entire writeback range, but with the block
plugging in the writepages call this is the way IO is submitted,
anyway.

It also means that we need to handle discontiguous page ranges.  If
the pages sent down by write_cache_pages to the writepage callback
are discontiguous, we need to detect this and put each discontiguous
page range into individual ioends. This is needed to ensure that the
ioend accurately represents the range of the file that it covers so
that file size updates during IO completion set the size correctly.
Failure to take into account the discontiguous ranges results in
files being too small when writeback patterns are non-sequential.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:19 +11:00
Dave Chinner
150d5be09c xfs: remove xfs_cancel_ioend
We currently have code to cancel ioends being built because we
change bufferhead state as we build the ioend. On error, this needs
to be unwound and so we have cancelling code that walks the buffers
on the ioend chain and undoes these state changes.

However, the IO submission path already handles state changes for
buffers when a submission error occurs, so we don't really need a
separate cancel function to do this - we can simply submit the
ioend chain with the specific error and it will be cancelled rather
than submitted.

Hence we can remove the explicit cancel code and just rely on
submission to deal with the error correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:12 +11:00
Dave Chinner
988ef92792 xfs: remove nonblocking mode from xfs_vm_writepage
Remove the nonblocking optimisation done for mapping lookups during
writeback. It's not clear that leaving a hole in the writeback range
just because we couldn't get a lock is really a win, as it makes us
do another small random IO later on rather than a large sequential
IO now.

As this gets in the way of sane error handling later on, just remove
for the moment and we can re-introduce an equivalent optimisation in
future if we see problems due to extent map lock contention.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:20:50 +11:00
Dave Chinner
12877da584 xfs: remove XFS_BUF_ZEROFLAGS macro
The places where we use this macro already clear unnecessary IO
flags (e.g. through xfs_bwrite()) or never have unexpected IO flags
set on them in the first place (e.g. iclog buffers). Remove the
macro from these locations, and where necessary clear only the
specific flags that are conditional in the current buffer context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:30 +11:00
Dave Chinner
5cfd28b6ab xfs: remove XBF_STALE flag wrapper macros
They only set/clear/check a flag, no need for obfuscating this
with a macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:11 +11:00
Dave Chinner
b68c08219a xfs: remove XBF_WRITE flag wrapper macros
They only set/clear/check a flag, no need for obfuscating this
with a macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:11 +11:00
Dave Chinner
0cac682ff6 xfs: remove XBF_READ flag wrapper macros
They only set/clear/check a flag, no need for obfuscating this
with a macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:11 +11:00
Dave Chinner
1157b32c73 xfs: remove XBF_ASYNC flag wrapper macros
They only set/clear/check a flag, no need for obfuscating this
with a macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:11 +11:00
Dave Chinner
b0388bf108 xfs: remove XBF_DONE flag wrapper macros
They only set/clear/check a flag, no need for obfuscating this
with a macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-10 15:01:11 +11:00
Dave Chinner
c19b3b05ae xfs: mode di_mode to vfs inode
Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
in the structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
83e06f21b4 xfs: move di_changecount to VFS inode
We can store the di_changecount in the i_version field of the VFS
inode and remove another 8 bytes from the xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
9e9a2674e4 xfs: move inode generation count to VFS inode
Pull another 4 bytes out of the xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
54d7b5c1d0 xfs: use vfs inode nlink field everywhere
The VFS tracks the inode nlink just like the xfs_icdinode. We can
remove the variable from the icdinode and use the VFS inode variable
everywhere, reducing the size of the xfs_icdinode by a further 4
bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
50997470ef xfs: reinitialise recycled VFS inode correctly
We are going to keep certain on-disk information in the VFS inode
rather than in a separate XFS specific stucture, so we have to be
careful of the VFS code clearing that information when we
re-initialise reclaimable cached inodes during lookup. If we don't
do this, then we lose critical information from the inode and that
results in corruption being detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
faeb4e4715 xfs: move v1 inode conversion to xfs_inode_from_disk
So we don't have to carry an di_onlink variable around anymore, move
the inode conversion from v1 inode format to v2 inode format into
xfs_inode_from_disk(). This means we can remove the di_onlink fields
from the struct xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
93f958f9c4 xfs: cull unnecessary icdinode fields
Now that the struct xfs_icdinode is not directly related to the
on-disk format, we can cull things in it we really don't need to
store:

	- magic number never changes
	- padding is not necessary
	- next_unlinked is never used
	- inode number is redundant
	- uuid is redundant
	- lsn is accessed directly from dinode
	- inode CRC is only accessed directly from dinode

Hence we can remove these from the struct xfs_icdinode and redirect
the code that uses them to the xfs_dinode appripriately.  This
reduces the size of the struct icdinode from 152 bytes to 88 bytes,
and removes a fair chunk of unnecessary code, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
3987848c7c xfs: remove timestamps from incore inode
The struct xfs_inode has two copies of the current timestamps in it,
one in the vfs inode and one in the struct xfs_icdinode. Now that we
no longer log the struct xfs_icdinode directly, we don't need to
keep the timestamps in this structure. instead we can copy them
straight out of the VFS inode when formatting the inode log item or
the on-disk inode.

This reduces the struct xfs_inode in size by 24 bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
f8d55aa052 xfs: introduce inode log format object
We currently carry around and log an entire inode core in the
struct xfs_inode. A lot of the information in the inode core is
duplicated in the VFS inode, but we cannot remove this duplication
of infomration because the inode core is logged directly in
xfs_inode_item_format().

Add a new function xfs_inode_item_format_core() that copies the
inode core data into a struct xfs_icdinode that is pulled directly
from the log vector buffer. This means we no longer directly
copy the inode core, but copy the structures one member at a time.
This will be slightly less efficient than copying, but will allow us
to remove duplicate and unnecessary items from the struct xfs_inode.

To enable us to do this, call the new structure a xfs_log_dinode,
so that we know it's different to the physical xfs_dinode and the
in-core xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Dave Chinner
bf85e0998a xfs: RT bitmap and summary buffers need verifiers
Buffers without verifiers issue runtime warnings on XFS. We don't
have anything we can actually verify in the RT buffers (no CRCs, not
magic numbers, etc), but we still need verifiers to avoid the
warnings.

Add a set of dummy verifier operations for the realtime buffers and
apply them in the appropriate places.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:41:45 +11:00
Dave Chinner
f67ca6eca8 xfs: RT bitmap and summary buffers are not typed
When logging buffers, we attach a type to them that follows the
buffer all the way into the log and is used to identify the buffer
contents in log recovery. Both the realtime summary buffers and the
bitmap buffers do not have types defined or set, so when we try to
log them we see assert failure:

XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 294

Fix this by adding buffer log format types for these buffers, and
add identification support into log recovery for them. Only build the log
recovery support if CONFIG_XFS_RT=y - we can't get into log recovery for real
time filesystems if support is not built into the kernel, and this avoids
potential build problems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:41:31 +11:00
Brian Foster
af055e37a9 xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdown
If the filesystem has shut down, xfs_end_io() currently sets an
error on the ioend and proceeds to ioend destruction. The ioend
might contain a truncate transaction if the I/O extended the size of
the file. This transaction is only cleaned up in
xfs_setfilesize_ioend(), however, which is skipped in this case.
This results in an xfs_log_ticket leak message when the associate
cache slab is destroyed (e.g., on rmmod).

This was originally reproduced by xfs/141 on a distro kernel. The
problem is reproducible on an upstream kernel, but not easily
detected in current upstream if the xfs_log_ticket cache happens to
be merged with another cache. This can be reproduced more
deterministically with the 'slab_nomerge' kernel boot option.

Update xfs_end_io() to proceed with normal end I/O processing after
an error is set on an ioend due to fs shutdown. The I/O type-based
processing is already designed to handle an I/O error and ensure
that the ioend is cleaned up correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 15:00:02 +11:00
Brian Foster
60630fe66e xfs: clean up unwritten buffers on write failure
The xfs_vm_write_failed() handler is currently responsible for cleaning
up any delalloc blocks over the range of a failed write beyond EOF.
Failure to do so results in warning messages and other inconsistencies
between buffer and extent state. The ->releasepage() handler currently
warns in the event of a page being released with either unwritten or
delalloc buffers, as neither is ever expected by the time a page is
released.

As has been reproduced by generic/083 on a -bsize=1k fs, it is currently
possible to trigger the ->releasepage() warning for a page with
unwritten buffers when a filesystem is near ENOSPC. This is reproduced
by the following sequence:

  $ mkfs.xfs -f -b size=1k -d size=100m <dev>
  $ mount <dev> /mnt/
  $
  $ xfs_io -fc "falloc -k 0 1k" /mnt/file
  $ dd if=/dev/zero of=/mnt/enospc conv=notrunc oflag=append
  $
  $ xfs_io -c "pwrite 512 1k" /mnt/file
  $ xfs_io -d -c "pwrite 16k 1k" /mnt/file

The first pwrite command attempts a block unaligned write across an
unwritten block and a hole. The delalloc for the hole fails with ENOSPC
and the subsequent error handling does not clean up the unwritten buffer
that was instantiated during the first ->get_block() call.

The second pwrite triggers a warning as part of the inode mapping
invalidation that occurs prior to direct I/O. The releasepage() handler
detects the unwritten buffer at this time, warns and prevents the
release of the page.

To deal with this problem, update xfs_vm_write_failed() to clean up
unwritten as well as delalloc buffers that are beyond EOF and within the
range of the failed write.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 15:00:02 +11:00
Darrick J. Wong
244efeafb6 xfs: move struct xfs_attr_shortform to xfs_da_format.h
Move the shortform attr structure definition to the same place as the
other attribute structure definitions for consistency and also so that
xfs/122 verifies the structure size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 15:00:01 +11:00
Michal Hocko
18f1df4e00 xfs: Make xfsaild freezeable again
Hendik has reported suspend failures due to xfsaild blocking the freezer
to settle down.
Jan 17 19:59:56 linux-6380 kernel: PM: Syncing filesystems ... done.
Jan 17 19:59:56 linux-6380 kernel: PM: Preparing system for sleep (mem)
Jan 17 19:59:56 linux-6380 kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Jan 17 19:59:56 linux-6380 kernel: Freezing remaining freezable tasks ...
Jan 17 19:59:56 linux-6380 kernel: Freezing of tasks failed after 20.002 seconds (1 tasks refusing to freeze, wq_busy=0):
Jan 17 19:59:56 linux-6380 kernel: xfsaild/dm-5    S 00000000     0  1293      2 0x00000080
Jan 17 19:59:56 linux-6380 kernel:  f0ef5f00 00000046 00000200 00000000 ffff9022 c02d3800 00000000 00000032
Jan 17 19:59:56 linux-6380 kernel:  ee0b2400 00000032 f71e0d00 f36fabc0 f0ef2d00 f0ef6000 f0ef2d00 f12f90c0
Jan 17 19:59:56 linux-6380 kernel:  f0ef5f0c c0844e44 00000000 f0ef5f6c f811e0be 00000000 00000000 f0ef2d00
Jan 17 19:59:56 linux-6380 kernel: Call Trace:
Jan 17 19:59:56 linux-6380 kernel:  [<c0844e44>] schedule+0x34/0x90
Jan 17 19:59:56 linux-6380 kernel:  [<f811e0be>] xfsaild+0x5de/0x600 [xfs]
Jan 17 19:59:56 linux-6380 kernel:  [<c0286cbb>] kthread+0x9b/0xb0
Jan 17 19:59:56 linux-6380 kernel:  [<c0848a79>] ret_from_kernel_thread+0x21/0x38

The issue has been there for quite some time but it has been made
visible by only by 24ba16bb3d ("xfs: clear PF_NOFREEZE for xfsaild
kthread") because the suspend started seeing xfsaild.

The above commit has missed that the !xfs_ail_min branch might call
schedule with TASK_INTERRUPTIBLE without calling try_to_freeze so the pm
suspend would wake up the kernel thread over and over again without any
progress. What we want here is to use freezable_schedule instead to hide
the thread from the suspend.

While we are here also change schedule_timeout to freezable variant to
prevent from spurious wakeups by suspend.

[dchinner: re-add set_freezeable call so the freezer will account properly
 for this kthread. ]

Reported-by: Hendrik Woltersdorf <hendrikw@arcor.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:59:07 +11:00
Eric Sandeen
de0b85a8cf xfs: remove unused function definitions
Old leftovers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:58:07 +11:00
Christoph Hellwig
edfd9dd549 xfs: move buffer invalidation to xfs_btree_free_block
... instead of leaving it in the methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:58:07 +11:00
Christoph Hellwig
c46ee8ad78 xfs: factor btree block freeing into a helper
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:58:07 +11:00
Christoph Hellwig
196328ec97 xfs: handle errors from ->free_blocks in xfs_btree_kill_iroot
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:58:07 +11:00
Christoph Hellwig
c19b104a67 xfs: fold xfs_vm_do_dio into xfs_vm_direct_IO
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
Christoph Hellwig
273dda76f7 xfs: don't use ioends for direct write completions
We only need to communicate two bits of information to the direct I/O
completion handler:

 (1) do we need to convert any unwritten extents in the range
 (2) do we need to check if we need to update the inode size based
     on the range passed to the completion handler

We can use the private data passed to the get_block handler and the
completion handler as a simple bitmask to communicate this information
instead of the current complicated infrastructure reusing the ioends
from the buffer I/O path, and thus avoiding a memory allocation and
a context switch for any non-trivial direct write.  As a nice side
effect we also decouple the direct I/O path implementation from that
of the buffered I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2016-02-08 14:40:51 +11:00
Christoph Hellwig
187372a3b9 direct-io: always call ->end_io if non-NULL
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
Carlos Maiolino
be6079461a xfs: Split default quota limits by quota type
Default quotas are globally set due historical reasons. IRIX only
supported user and project quotas, and default quota was only
applied to user quotas.

In Linux, when a default quota is set, all different quota types
inherits the same default value.

An user with a quota limit larger than the default quota value, will
still be limited to the default value because the group quotas also
inherits the default quotas. Unless the group which the user belongs
to have a custom quota limit set.

This patch aims to split the default quota value by quota type.
Allowing each quota type having different default values.

Default time limits are still set globally. XFS does not set a
per-user/group timer, but a single global timer. For changing this
behavior, some changes should be made in user-space tools another
bugs being fixed.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:27:55 +11:00
Eric Sandeen
296c24e26e xfs: wire up Q_XGETNEXTQUOTA / get_nextdqblk
Add code to allow the Q_XGETNEXTQUOTA quotactl to quickly find
all active quotas by examining the quota inode, and skipping
over unallocated or uninitialized regions.

Userspace can then use this interface rather than i.e. a
getpwent() loop when asked to report all active quotas.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:27:38 +11:00
Eric Sandeen
8aa7d37ebf xfs: Factor xfs_seek_hole_data into helper
Factor xfs_seek_hole_data into an unlocked helper which takes
an xfs inode rather than a file for internal use.

Also allow specification of "end" - the vfs lseek interface is
defined such that any offset past eof/i_size shall return -ENXIO,
but we will use this for quota code which does not maintain i_size,
and we want to be able to SEEK_DATA past i_size as well.  So the
lseek path can send in i_size, and the quota code can determine
its own ending offset.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:25:16 +11:00
Eric Sandeen
4d4d9523b4 xfs: get quota inode from mp & flags rather than dqp
Allow us to get the appropriate quota inode from any
mp & quota flags, not necessarily associated with a
particular dqp.  Needed for when we are searching for
the next active ID with quotas and we want to examine
the quota inode.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:23:23 +11:00
Eric Sandeen
a484bcdd13 xfs: don't overflow quota ID when initializing dqblk
Quota IDs are unsigned, and so we can pass in values up
to 2^32-1.  But if we try to initialize a block containing
values over MAX_INT, curid will overflow and assert.

curid holds a quota ID, so give it the proper
xfs_dqid_t type (and remove the now-impossible ASSERT).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:22:58 +11:00
Darrick J. Wong
8e0bd4925b xfs: fix endianness error when checking log block crc on big endian platforms
Since the checksum function and the field are both __le32, don't
perform endian conversion when comparing the two.  This fixes mount
failures on ppc64.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:03:58 +11:00
Dave Chinner
4b680afb42 xfs: lock rt summary inode on allocation
RT allocation can fail on a debug kernel with:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 4039

When modifying the summary inode during allocation. This occurs
because the summary inode is never locked, and xfs_bmapi_*
operations expect it to be locked. The summary inode is effectively
protected byt he lock on the bitmap inode, so this really is only a
debug kernel issue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 10:46:51 +11:00
Linus Torvalds
cc673757e2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull final vfs updates from Al Viro:

 - The ->i_mutex wrappers (with small prereq in lustre)

 - a fix for too early freeing of symlink bodies on shmem (they need to
   be RCU-delayed) (-stable fodder)

 - followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: abort dedupe loop if fatal signals are pending
  make sure that freeing shmem fast symlinks is RCU-delayed
  wrappers for ->i_mutex access
  lustre: remove unused declaration
2016-01-23 12:24:56 -08:00
Ross Zwisler
5eb88dca9c xfs: call dax_pfn_mkwrite() for DAX fsync/msync
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22 17:02:18 -08:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
d5ffdf8b4a xfs: Update 2 for 4.5-rc1
This update contains:
 
 o promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
   it can be shared with other filesystems. The ext4 project quota
   functionality is the first target for this. The commits in this
   series have not been updated with review or final SOB tags because
   the branch they were originally published in was needed by ext4.
   Those tags are:
 
   Reviewed-by: Theodore Ts'o <tytso@mit.edu>
   Signed-off-by: Dave Chinner <david@fromrobit.com>
 
 o Revert a change that is causing suspend failures.
 o Fix a use-after-free that can occur on log mount failures. Been
   around forever, but now exposed by other changes to log recovery
   made in the first 4.5 merge.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJWoV0hAAoJEK3oKUf0dfodSCQP/RXlZp6TQhv2DQ2MW4AeZRzs
 kzp3zvWUN1udB0fgAARMUDbHHeqEp5gUB6Fj8GOjgh69VGac1pjR2GOvEA9UbnhL
 uLQwaRggVB/BJV6+hDUw283kENXE1H8JcDiEIFratwdiZ6KrhniMptzrbUnG22LO
 cBLzHOCFI0x4ib2fdTvrVV8bNaAaLYViYUxuVwzblzhoODN4Nmv5HZ5BlMHDFJsd
 E47Yw/0tdYFVRDuujN22ylYsKsySXBxPaWyUvDDlW/ryeKSfwn3V8Y7BSDZU4vUZ
 CFstsqlzEySGrNNCfor5bFn9EO3i882M+DU60UhZAKRgvAzANAsxjJ97B8Of5KA+
 /0OQarl0ZNJ93g6mZJ2bhuVpRCIGWJ3rBl9+GK8JdtsjF0mPOvrusKTQKoz1frK7
 B8h52P+jxfqrrqeqpNigMWfDKYkXCfUUMAJm57+QILAoTNRupAzgFyXZnSgAermE
 jaDfvnkaSZxfaLtTOlkkpGukhbFubhAWTk3TksVxICPXztZelQLmmbqjZnTYFCT/
 dKieKbwop58DBTycFuzCrWiSjXjodAq/+IfpAQcvJ5xZPLtgfjHxQaHD6zsOVKzQ
 lWosgYOnIaN/PYPOpAzo0sRDf80d5KFjwcdSjrWZVZ5lGfAsx8iYErh3v0Xv3rkE
 YuKQw2AjVVtD64SfHvIn
 =wEy8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull more xfs updates from Dave Chinner:
 "This is the second update for XFS that I mentioned in the original
  pull request last week.

  It contains a revert for a suspend regression in 4.4 and a fix for a
  long standing log recovery issue that has been further exposed by all
  the log recovery changes made in the original 4.5 merge.

  There is one more thing in this pull request - one that I forgot to
  merge into the origin.  That is, pulling the XFS_IOC_FS[GS]ETXATTR
  ioctl up to the VFS level so that other filesystems can also use it
  for modifying project quota IDs

  Summary:

   - promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
     it can be shared with other filesystems.  The ext4 project quota
     functionality is the first target for this.  The commits in this
     series have not been updated with review or final SOB tags because
     the branch they were originally published in was needed by ext4.
     Those tags are:

        Reviewed-by: Theodore Ts'o <tytso@mit.edu>
        Signed-off-by: Dave Chinner <david@fromrobit.com>

   - Revert a change that is causing suspend failures.

   - Fix a use-after-free that can occur on log mount failures.  Been
     around forever, but now exposed by other changes to log recovery
     made in the first 4.5 merge"

* tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: log mount failures don't wait for buffers to be released
  Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
  xfs: introduce per-inode DAX enablement
  xfs: use FS_XFLAG definitions directly
  fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion
2016-01-22 10:54:13 -08:00
Dave Chinner
ee3804d9f9 Merge branch 'xfs-misc-fixes-for-4.5-3' into for-next 2016-01-19 08:28:36 +11:00
Dave Chinner
85bec5460a xfs: log mount failures don't wait for buffers to be released
Recently I've been seeing xfs/051 fail on 1k block size filesystems.
Trying to trace the events during the test lead to the problem going
away, indicating that it was a race condition that lead to this
ASSERT failure:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
.....
[<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
[<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
[<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
[<ffffffff811d8800>] mount_bdev+0x180/0x1b0
[<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
[<ffffffff811d90a8>] mount_fs+0x38/0x170
[<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
[<ffffffff811f7018>] do_mount+0x218/0xd60
[<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0

When I finally caught it with tracing enabled, I saw that AG 2 had
an elevated reference count and a buffer was responsible for it. I
tracked down the specific buffer, and found that it was missing the
final reference count release that would put it back on the LRU and
hence be found by xfs_wait_buftarg() calls in the log mount failure
handling.

The last four traces for the buffer before the assert were (trimmed
for relevance)

kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC

This is an async write that is completing, so there's nobody waiting
for it directly.  Hence we call xfs_buf_relse() once all the
processing is complete. That does:

static inline void xfs_buf_relse(xfs_buf_t *bp)
{
	xfs_buf_unlock(bp);
	xfs_buf_rele(bp);
}

Now, it's clear that mount is waiting on the buffer lock, and that
it has been released by xfs_buf_relse() and gained by mount. This is
expected, because at this point the mount process is in
xfs_buf_delwri_submit() waiting for all the IO it submitted to
complete.

The mount process, however, is waiting on the lock for the buffer
because it is in xfs_buf_delwri_submit(). This waits for IO
completion, but it doesn't wait for the buffer reference owned by
the IO to go away. The mount process collects all the completions,
fails the log recovery, and the higher level code then calls
xfs_wait_buftarg() to free all the remaining buffers in the
filesystem.

The issue is that on unlocking the buffer, the scheduler has decided
that the mount process has higher priority than the the kworker
thread that is running the IO completion, and so immediately
switched contexts to the mount process from the semaphore unlock
code, hence preventing the kworker thread from finishing the IO
completion and releasing the IO reference to the buffer.

Hence by the time that xfs_wait_buftarg() is run, the buffer still
has an active reference and so isn't on the LRU list that the
function walks to free the remaining buffers. Hence we miss that
buffer and continue onwards to tear down the mount structures,
at which time we get find a stray reference count on the perag
structure. On a non-debug kernel, this will be ignored and the
structure torn down and freed. Hence when the kworker thread is then
rescheduled and the buffer released and freed, it will access a
freed perag structure.

The problem here is that when the log mount fails, we still need to
quiesce the log to ensure that the IO workqueues have returned to
idle before we run xfs_wait_buftarg(). By synchronising the
workqueues, we ensure that all IO completions are fully processed,
not just to the point where buffers have been unlocked. This ensures
we don't end up in the situation above.

cc: <stable@vger.kernel.org> # 3.18
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-19 08:28:10 +11:00
Dave Chinner
3e85286e75 Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
This reverts commit 24ba16bb3d as it
prevents machines from suspending. This regression occurs when the
xfsaild is idle on entry to suspend, and so there s no activity to
wake it from it's idle sleep and hence see that it is supposed to
freeze. Hence the freezer times out waiting for it and suspend is
cancelled.

There is no obvious fix for this short of freezing the filesystem
properly, so revert this change for now.

cc: <stable@vger.kernel.org> # 4.4
Signed-off-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-19 08:21:46 +11:00
Dave Chinner
4c931f770d Merge branch 'xfs-setxattr-promotion' into for-next 2016-01-19 08:16:08 +11:00