35845 Commits

Author SHA1 Message Date
Goldwyn Rodrigues
8ed6b23709 ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock
The following patches are reverted in this patch because these patches
caused performance regression in the remote unlink() calls.

  ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
  f7b1aa69be13 - ocfs2: Fix deadlock on umount
  5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

Previous patches in this series removed the possible deadlocks from
downconvert thread so the above patches shouldn't be needed anymore.

The regression is caused because these patches delay the iput() in case
of dentry unlocks.  This also delays the unlocking of the open lockres.
The open lockresource is required to test if the inode can be wiped from
disk or not.  When the deleting node does not get the open lock, it
marks it as orphan (even though it is not in use by another
node/process) and causes a journal checkpoint.  This delays operations
following the inode eviction.  This also moves the inode to the orphaned
inode which further causes more I/O and a lot of unneccessary orphans.

The following script can be used to generate the load causing issues:

  declare -a create
  declare -a remove
  declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
  unique="`mktemp -u XXXXX`"
  script="/tmp/idontknow-${unique}.sh"
  cat <<EOF > "${script}"
  for n in {1..8}; do mkdir -p test/dir\${n}
    eval touch test/dir\${n}/foo{1.."\$1"}
  done
  EOF
  chmod 700 "${script}"

  function fcreate ()
  {
    exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
  }

  function fremove ()
  {
    exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
  }

  function fcp ()
  {
    exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
  }

  echo -------------------------------------------------
  echo "| # files | create #s | copy #s | remove #s |"
  echo -------------------------------------------------
  for ((x=0; x < ${#iterations[*]} ; x++)) do
    create[$x]="`fcreate ${iterations[$x]}`"
    copy[$x]="`fcp ${iterations[$x]}`"
    remove[$x]="`fremove`"
    printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
  done
  rm "${script}"
  echo "------------------------"

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Jan Kara
84d86f83f9 ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread
If we are dropping last inode reference from downconvert thread, we will
end up calling ocfs2_mark_lockres_freeing() which can block if the lock
we are freeing is queued thus creating an A-A deadlock.  Luckily, since
we are the downconvert thread, we can immediately dequeue the lock and
thus avoid waiting in this case.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:55 -07:00
Jan Kara
e3a767b60f ocfs2: implement delayed dropping of last dquot reference
We cannot drop last dquot reference from downconvert thread as that
creates the following deadlock:

NODE 1                                  NODE2
holds dentry lock for 'foo'
holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                        dquot_initialize(bar)
                                          ocfs2_dquot_acquire()
                                            ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                            ...
downconvert thread (triggered from another
node or a different process from NODE2)
  ocfs2_dentry_post_unlock()
    ...
    iput(foo)
      ocfs2_evict_inode(foo)
        ocfs2_clear_inode(foo)
          dquot_drop(inode)
            ...
	    ocfs2_dquot_release()
              ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
               - blocks
                                            finds we need more space in
                                            quota file
                                            ...
                                            ocfs2_extend_no_holes()
                                              ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                - deadlocks waiting for
                                                  downconvert thread

We solve the problem by postponing dropping of the last dquot reference to
a workqueue if it happens from the downconvert thread.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
9f985cb6c4 quota: provide function to grab quota structure reference
Provide dqgrab() function to get quota structure reference when we are
sure it already has at least one active reference.  Make use of this
function inside quota code.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
bd62ad7aeb ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat later
Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
verify inode is actually a sane one to delete.  We certainly don't want
to initialize quota for system inodes etc.  This also avoids calling
into quota code from downconvert thread.

Add more details into the comment why bailing out from
ocfs2_delete_inode() when we are in downconvert thread is OK.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Jan Kara
7bf619c142 ocfs2: remove OCFS2_INODE_SKIP_DELETE flag
The flag was never set, delete it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Goldwyn Rodrigues
765aabbbc7 ocfs2: add dlm_recover_callback_support in sysfs
This is a part of the nocontrold feature which was incorporated sometime
back.

This is required for backward compatibility of the tools, specifically
the scenario where the tools with recovery callback is used with a
kernel not using the recovery callbacks (older kernel + newer tools).
The tools look for this file to understand if the kernel supports DLM
recovery callbacks.

For kernels which support recovery callbacks but will miss this patch,
ocfs2 will continue to use the older API and would still be able to
mount the filesystem.

[akpm@linux-foundation.org: simplify]
[sfr@canb.auug.org.au: VERIFY_OCTAL_PERMISSIONS fix up]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Junxiao Bi
ded2cf7141 ocfs2: dlm: fix recovery hung
There is a race window in dlm_do_recovery() between dlm_remaster_locks()
and dlm_reset_recovery() when the recovery master nearly finish the
recovery process for a dead node.  After the master sends FINALIZE_RECO
message in dlm_remaster_locks(), another node may become the recovery
master for another dead node, and then send the BEGIN_RECO message to
all the nodes included the old master, in the handler of this message
dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
dlm->reco.new_master will be set to the second dead node and the new
master, then in dlm_reset_recovery(), these two variables will be reset
to default value.  This will cause new recovery master can not finish
the recovery process and hung, at last the whole cluster will hung for
recovery.

old recovery master:                                 new recovery master:
dlm_remaster_locks()
                                                  become recovery master for
                                                  another dead node.
                                                  dlm_send_begin_reco_message()
dlm_begin_reco_handler()
{
 if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
  return -EAGAIN;
 }
 dlm_set_reco_master(dlm, br->node_idx);
 dlm_set_reco_dead_node(dlm, br->dead_node);
}
dlm_reset_recovery()
{
 dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
 dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
}
                                                  will hang in dlm_remaster_locks() for
                                                  request dlm locks info

Before send FINALIZE_RECO message, recovery master should set
DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
this can break the race windows as the BEGIN_RECO messages will not be
handled before DLM_RECO_STATE_FINALIZE flag is cleared.

A similar race may happen between new recovery master and normal node
which is in dlm_finalize_reco_handler(), also fix it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Junxiao Bi
34aa8dac48 ocfs2: dlm: fix lock migration crash
This issue was introduced by commit 800deef3f6f8 ("ocfs2: use
list_for_each_entry where benefical") in 2007 where it replaced
list_for_each with list_for_each_entry.  The variable "lock" will point
to invalid data if "tmpq" list is empty and a panic will be triggered
due to this.  Sunil advised reverting it back, but the old version was
also not right.  At the end of the outer for loop, that
list_for_each_entry will also set "lock" to an invalid data, then in the
next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
data and cause the panic.  So reverting the list_for_each back and reset
"lock" to NULL to fix this issue.

Another concern is that this seemes can not happen because the "tmpq"
list should not be empty.  Let me describe how.

old lock resource owner(node 1):                                  migratation target(node 2):
image there's lockres with a EX lock from node 2 in
granted list, a NR lock from node x with convert_type
EX in converting list.
dlm_empty_lockres() {
 dlm_pick_migration_target() {
   pick node 2 as target as its lock is the first one
   in granted list.
 }
 dlm_migrate_lockres() {
   dlm_mark_lockres_migrating() {
     res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
     wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
	 //after the above code, we can not dirty lockres any more,
     // so dlm_thread shuffle list will not run
                                                                   downconvert lock from EX to NR
                                                                   upconvert lock from NR to EX
<<< migration may schedule out here, then
<<< node 2 send down convert request to convert type from EX to
<<< NR, then send up convert request to convert type from NR to
<<< EX, at this time, lockres granted list is empty, and two locks
<<< in the converting list, node x up convert lock followed by
<<< node 2 up convert lock.

	 // will set lockres RES_MIGRATING flag, the following
	 // lock/unlock can not run
     dlm_lockres_release_ast(dlm, res);
   }

   dlm_send_one_lockres()
                                                                 dlm_process_recovery_data()
                                                                   for (i=0; i<mres->num_locks; i++)
                                                                     if (ml->node == dlm->node_num)
                                                                       for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
                                                                        list_for_each_entry(lock, tmpq, list)
                                                                        if (lock) break; <<< lock is invalid as grant list is empty.
                                                                       }
                                                                       if (lock->ml.node != ml->node)
                                                                         BUG() >>> crash here
 }

I see the above locks status from a vmcore of our internal bug.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:54 -07:00
Darrick J. Wong
2931cdcb49 ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_file
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete.  This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.

Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit.  Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.

This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file().  aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
joyce.xue
a75fe48cad ocfs2: remove unused variable uuid_net_key in ocfs2_initialize_super
Variable uuid_net_key in ocfs2_initialize_super() is not used.  Clean it
up.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Wengang Wang
c18ceab012 ocfs2: change ip_unaligned_aio to of type mutex from atomit_t
There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.

The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid.  Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Zongxun Wang
181a9a043b ocfs2: fix null pointer dereference when access dlm_state before launching dlm thread
When mounting an ocfs2 volume, it will firstly generate a file
/sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread.
So the following situation will cause a null pointer dereference.
dlm_debug_init -> access file dlm_state which will call dlm_state_print ->
dlm_launch_thread

Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread
can fix this issue.

Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:53 -07:00
Jan Kara
d507816b58 fanotify: move unrelated handling from copy_event_to_user()
Move code moving event structure to access_list from copy_event_to_user()
to fanotify_read() where it is more logical (so that we can immediately
see in the main loop that we either move the event to a different list
or free it).  Also move special error handling for permission events
from copy_event_to_user() to the main loop to have it in one place with
error handling for normal events.  This makes copy_event_to_user()
really only copy the event to user without any side effects.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
d8aaab4f61 fanotify: reorganize loop in fanotify_read()
Swap the error / "read ok" branches in the main loop of fanotify_read().
We will grow the "read ok" part in the next patch and this makes the
indentation easier.  Also it is more common to have error conditions
inside an 'if' instead of the fast path.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
9573f79355 fanotify: convert access_mutex to spinlock
access_mutex is used only to guard operations on access_list.  There's
no need for sleeping within this lock so just make a spinlock out of it.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
f083441ba8 fanotify: use fanotify event structure for permission response processing
Currently, fanotify creates new structure to track the fact that
permission event has been reported to userspace and someone is waiting
for a response to it.  As event structures are now completely in the
hands of each notification framework, we can use the event structure for
this tracking instead of allocating a new structure.

Since this makes the event structures for normal events and permission
events even more different and the structures have different lifetime
rules, we split them into two separate structures (where permission
event structure contains the structure for a normal event).  This makes
normal events 8 bytes smaller and the code a tad bit cleaner.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Jan Kara
3298cf37be fanotify: remove useless bypass_perm check
The prepare_for_access_response() function checks whether
group->fanotify_data.bypass_perm is set.  However this test can never be
true because prepare_for_access_response() is called only from
fanotify_read() which means fanotify group is alive with an active fd
while bypass_perm is set from fanotify_release() when all file
descriptors pointing to the group are closed and the group is going
away.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:51 -07:00
Fabian Frederick
ddae82d8e6 fs/freevxfs/vxfs_lookup.c: update function comment
nameidata was replaced by flags in commit 00cd8dd3bf95 ("stop passing
nameidata to ->lookup()").

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Fabian Frederick
9ee108b2c6 fs/cifs/cifsfs.c: add __init to cifs_init_inodecache()
cifs_init_inodecache is only called by __init init_cifs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Jan Kara
5acda9d12d bdi: avoid oops on device removal
After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Derek Basehore
6ca738d60c backing_dev: fix hung task on sync
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zento.linux.org.uk>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benson Leung <bleung@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Luigi Semenzato <semenzato@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Dave Chinner
a6cf33bc56 Merge branch 'xfs-bug-fixes-for-3.15-3' into for-next 2014-04-04 08:07:35 +11:00
Mark Tinguely
c88547a811 xfs: fix directory hash ordering bug
Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown.  xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

cc: <stable@vger.kernel.org>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 07:10:49 +11:00
Linus Torvalds
32d01dc7be Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot updates for cgroup:

   - The biggest one is cgroup's conversion to kernfs.  cgroup took
     after the long abandoned vfs-entangled sysfs implementation and
     made it even more convoluted over time.  cgroup's internal objects
     were fused with vfs objects which also brought in vfs locking and
     object lifetime rules.  Naturally, there are places where vfs rules
     don't fit and nasty hacks, such as credential switching or lock
     dance interleaving inode mutex and cgroup_mutex with object serial
     number comparison thrown in to decide whether the operation is
     actually necessary, needed to be employed.

     After conversion to kernfs, internal object lifetime and locking
     rules are mostly isolated from vfs interactions allowing shedding
     of several nasty hacks and overall simplification.  This will also
     allow implmentation of operations which may affect multiple cgroups
     which weren't possible before as it would have required nesting
     i_mutexes.

   - Various simplifications including dropping of module support,
     easier cgroup name/path handling, simplified cgroup file type
     handling and task_cg_lists optimization.

   - Prepatory changes for the planned unified hierarchy, which is still
     a patchset away from being actually operational.  The dummy
     hierarchy is updated to serve as the default unified hierarchy.
     Controllers which aren't claimed by other hierarchies are
     associated with it, which BTW was what the dummy hierarchy was for
     anyway.

   - Various fixes from Li and others.  This pull request includes some
     patches to add missing slab.h to various subsystems.  This was
     triggered xattr.h include removal from cgroup.h.  cgroup.h
     indirectly got included a lot of files which brought in xattr.h
     which brought in slab.h.

  There are several merge commits - one to pull in kernfs updates
  necessary for converting cgroup (already in upstream through
  driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
  cgroup: remove useless argument from cgroup_exit()
  cgroup: fix spurious lockdep warning in cgroup_exit()
  cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
  cgroup: break kernfs active_ref protection in cgroup directory operations
  cgroup: fix cgroup_taskset walking order
  cgroup: implement CFTYPE_ONLY_ON_DFL
  cgroup: make cgrp_dfl_root mountable
  cgroup: drop const from @buffer of cftype->write_string()
  cgroup: rename cgroup_dummy_root and related names
  cgroup: move ->subsys_mask from cgroupfs_root to cgroup
  cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
  cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
  cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
  cgroup: reorganize cgroup bootstrapping
  cgroup: relocate setting of CGRP_DEAD
  cpuset: use rcu_read_lock() to protect task_cs()
  cgroup_freezer: document freezer_fork() subtleties
  cgroup: update cgroup_transfer_tasks() to either succeed or fail
  cgroup: drop task_lock() protection around task->cgroups
  cgroup: update how a newly forked task gets associated with css_set
  ...
2014-04-03 13:05:42 -07:00
Dan Carpenter
805eeb8e04 xfs: extra semi-colon breaks a condition
There were some extra semi-colons here which mean that we return true
unintentionally.

Fixes: a49935f200e2 ('xfs: xfs_check_page_type buffer checks need help')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 06:56:30 +11:00
Rashika Kheria
0961f02a37 fs: Mark functions as static in exofs/ore_raid.c
Mark functions as static in exofs/ore_raid.c because they are not used
outside this file.

This also eliminates the following warning in exofs/ore_raid.c:
fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes]
fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-04-03 11:40:55 +03:00
Rashika Kheria
af8d0cc0d2 fs: Mark function as static in exofs/super.c
Mark function as static in exofs/super.c because it is not used outside
this file.

This also eliminates the following warning in exofs/super.c:
 fs/exofs/super.c:546:5: warning: no previous prototype \
 for __alloc_dev_table[-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-04-03 11:36:42 +03:00
Yan, Zheng
c137a32a40 ceph: print inode number for LOOKUPINO request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:54 +08:00
Yan, Zheng
19913b4eac ceph: add get_name() NFS export callback
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
8996f4f23d ceph: fix ceph_fh_to_parent()
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
9017c2ec78 ceph: add get_parent() NFS export callback
The callback uses LOOKUPPARENT MDS request to find parent.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
4f32b42dca ceph: simplify ceph_fh_to_dentry()
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
f1fc4fee3b ceph: fscache: Wait for completion of object initialization
The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
32d3e148dd ceph: fscache: Update object store limit after file writing
Synchronize object->store_limit[_l] with new inode->i_size after file writing.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
020c4bddc0 ceph: fscache: add an interface to synchronize object store limit
Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Sage Weil
4b58c9b19b ceph: do not set r_old_dentry_dir on link()
This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
844d87c332 ceph: do not assume r_old_dentry[_dir] always set together
Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
752c8bdcfe ceph: do not chain inode updates to parent fsync
The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Sage Weil
180061a58c ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()
This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
15289dc85b ceph: let MDS adjust readdir 'frag'
If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
dcd3cc05e5 ceph: fix reset_readdir()
When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Yan, Zheng
f049420607 ceph: fix ceph_dir_llseek()
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Linus Torvalds
159d8133d0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science -- mostly documentation and comment updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  sparse: fix comment
  doc: fix double words
  isdn: capi: fix "CAPI_VERSION" comment
  doc: DocBook: Fix typos in xml and template file
  Bluetooth: add module name for btwilink
  driver core: unexport static function create_syslog_header
  mmc: core: typo fix in printk specifier
  ARM: spear: clean up editing mistake
  net-sysfs: fix comment typo 'CONFIG_SYFS'
  doc: Insert MODULE_ in module-signing macros
  Documentation: update URL to hfsplus Technote 1150
  gpio: update path to documentation
  ixgbe: Fix format string in ixgbe_fcoe.
  Kconfig: Remove useless "default N" lines
  user_namespace.c: Remove duplicated word in comment
  CREDITS: fix formatting
  treewide: Fix typo in Documentation/DocBook
  mm: Fix warning on make htmldocs caused by slab.c
  ata: ata-samsung_cf: cleanup in header file
  idr: remove unused prototype of idr_free()
2014-04-02 16:23:38 -07:00
Jeff Mahoney
01d8885785 reiserfs: fix race in readdir
jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2)

The -ENOENT is due to readdir calling dir_emit on the same entry twice.

If the dir_emit callback sleeps and the tree is changed underneath us,
we won't be able to trust deh_offset(deh) anymore. We need to save
next_pos before we might sleep so we can find the next entry.

CC: stable@vger.kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-04-03 00:06:14 +02:00
Linus Torvalds
b9f2b21a32 Devicetree changes for v3.15
Updates to devicetree core code. This branch contains the following notable changes:
 * Add reserved memory binding
 * Make struct device_node a kobject and remove legacy /proc/device-tree
 * ePAPR conformance fixes
 * Update in-kernel DTC copy to version v1.4.0
 * Preparation changes for dynamic device tree overlays
 * minor bug fixes and documentation changes
 
 The most significant change in this branch is the conversion of struct
 device_node to be a kobject that is exposed via sysfs and removal of the
 old /proc/device-tree code. This simplifies the device tree handling
 code and tightens up the lifecycle on device tree nodes.
 
 [updated: added fix for dangling select PROC_DEVICETREE]
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOyNwAAoJEMWQL496c2LNZY0QAIreUrpo3/hKRau61EDPXkOA
 UFRyPUHD0k/dNXWWDbTfvKH/nAfzdVwejhePqEWiODiFOFkq7JyQlMKPA+CZuZj0
 ygN4215A1yj/hDf6JRD5Zn4WGpawDt9InlbZSps6P5dd8voV5t5dz6uzz+Y7uqaK
 CAjTDlBSmxEen5vRHiHQgKv74au/+b9yfSURjPQVWg46+wl3WJwjsdzerphm4unW
 tpEr8zkIsm51mqqAx4penIuiovh7+L2J5v4BFeg8o+kaZEuZpVxLHJPOuBd5hdom
 zeqEIj3AqHTh5suYIHe4aAbZ2wMP3kYGgkPGwfWLnwLyULxalcCtGZeaCi9nwTFj
 Fdj+7f17ocrt5mif0f5Deufi1LqJsDjhY6G9p7HuV7Y9hsMILpJIUoGENPji+TWj
 BA4L45eaPmNYdKJytEtFD7F2WnXeHZ6fDtYho/39DWW+Bt16IFX85T199irhxGG4
 byN6LRaahk2UeycSXkQHAlWOQHqzBcJJAkQLN2iahzyYRr9Dy+VI2E9clm53m49O
 YQYcONdUlMYrtfRwJpbB9XHM0HgZUvg0LT5z/iHQs9uJtoo33Oj+zxFixyZLQ9Dq
 qyLqQWEpV9gFLAo9tpf56gffkLiJRsHkX4UJ6oTtj4DY1WWU9H81jjCvv/7flzp/
 8ZyyZzANQf1DZ9kqO2v+
 =lyA5
 -----END PGP SIGNATURE-----

Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux

Pull devicetree changes from Grant Likely:
 "Updates to devicetree core code.  This branch contains the following
  notable changes:

   - add reserved memory binding
   - make struct device_node a kobject and remove legacy
     /proc/device-tree
   - ePAPR conformance fixes
   - update in-kernel DTC copy to version v1.4.0
   - preparatory changes for dynamic device tree overlays
   - minor bug fixes and documentation changes

  The most significant change in this branch is the conversion of struct
  device_node to be a kobject that is exposed via sysfs and removal of
  the old /proc/device-tree code.  This simplifies the device tree
  handling code and tightens up the lifecycle on device tree nodes.

  [updated: added fix for dangling select PROC_DEVICETREE]"

* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
  dt: Remove dangling "select PROC_DEVICETREE"
  of: Add support for ePAPR "stdout-path" property
  of: device_node kobject lifecycle fixes
  of: only scan for reserved mem when fdt present
  powerpc: add support for reserved memory defined by device tree
  arm64: add support for reserved memory defined by device tree
  of: add missing major vendors
  of: add vendor prefix for SMSC
  of: remove /proc/device-tree
  of/selftest: Add self tests for manipulation of properties
  of: Make device nodes kobjects so they show up in sysfs
  arm: add support for reserved memory defined by device tree
  drivers: of: add support for custom reserved memory drivers
  drivers: of: add initialization code for dynamic reserved memory
  drivers: of: add initialization code for static reserved memory
  of: document bindings for reserved-memory nodes
  Revert "of: fix of_update_property()"
  kbuild: dtbs_install: new make target
  ARM: mvebu: Allows to get the SoC ID even without PCI enabled
  of: Allows to use the PCI translator without the PCI core
  ...
2014-04-02 14:27:15 -07:00
Linus Torvalds
7125764c5d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull compat time conversion changes from Peter Anvin:
 "Despite the branch name this is really neither an x86 nor an
  x32-specific patchset, although it the implementation of the
  discussions that followed the x32 security hole a few months ago.

  This removes get/put_compat_timespec/val() and replaces them with
  compat_get/put_timespec/val() which are savvy as to the current status
  of COMPAT_USE_64BIT_TIME.

  It removes several unused and/or incorrect/misleading functions (like
  compat_put_timeval_convert which doesn't in fact do any conversion)
  and also replaces several open-coded implementations what is now
  called compat_convert_timespec() with that function"

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  compat: Fix sparse address space warnings
  compat: Get rid of (get|put)_compat_time(val|spec)
2014-04-02 12:51:41 -07:00
Rajat Jain
f3846266f5 fuse: fix "uninitialized variable" warning
Fix the following warning:

In file included from include/linux/fs.h:16:0,
                 from fs/fuse/fuse_i.h:13,
                 from fs/fuse/file.c:9:
fs/fuse/file.c: In function 'fuse_file_poll':
include/linux/rbtree.h:82:28: warning: 'parent' may be used
uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/file.c:2592:27: note: 'parent' was declared here

Signed-off-by: Rajat Jain <rajatxjain@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:51 +02:00
Pavel Emelyanov
4d99ff8f12 fuse: Turn writeback cache on
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:50 +02:00
Pavel Emelyanov
ea8cd33390 fuse: Fix O_DIRECT operations vs cached writeback misorder
The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02 15:38:50 +02:00