52702 Commits

Author SHA1 Message Date
Nikolay Borisov
952bd3db0d btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ error message adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:25:14 +01:00
Liu Bo
900c998168 Btrfs: fix unexpected -EEXIST when creating new inode
The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots.  However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.

cc: <stable@vger.kernel.org> v4.4-rc6+
Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:53 +01:00
Liu Bo
1a932ef4e4 Btrfs: fix use-after-free on root->orphan_block_rsv
I got these from running generic/475,

WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs]
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs]
Call Trace:
  btrfs_orphan_release_metadata+0x9f/0x200 [btrfs]
  btrfs_orphan_del+0x10d/0x170 [btrfs]
  btrfs_setattr+0x500/0x640 [btrfs]
  notify_change+0x7ae/0x870
  do_truncate+0xca/0x130
  vfs_truncate+0x2ee/0x3d0
  do_sys_truncate+0xaf/0xf0
  SyS_truncate+0xe/0x10
  entry_SYSCALL_64_fastpath+0x1f/0x96

The race is between btrfs_orphan_commit_root and btrfs_orphan_del,
        t1                                        t2
btrfs_orphan_commit_root                     btrfs_orphan_del
   spin_lock
   check (&root->orphan_inodes)
   root->orphan_block_rsv = NULL;
   spin_unlock
                                             atomic_dec(&root->orphan_inodes);
                                             access root->orphan_block_rsv

Accessing root->orphan_block_rsv must be done before decreasing
root->orphan_inodes.

cc: <stable@vger.kernel.org> v3.12+
Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:40 +01:00
Liu Bo
e8f1bc1493 Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
This regression is introduced in
commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction").

There are two problems,

a) it is ->destroy_inode() that does the final free on inode, not
   ->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.

This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().

Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:35 +01:00
Liu Bo
55237a5f24 Btrfs: fix extent state leak from tree log
It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.

cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:30 +01:00
Liu Bo
1846430c24 Btrfs: fix crash due to not cleaning up tree log block's dirty bits
In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,

umount
 -> close_ctree
    -> stop workers
    -> iput(btree_inode)
       -> iput_final
          -> write_inode_now
	     -> ...
	       -> queue job on stop'd workers

cc: <stable@vger.kernel.org> v3.12+
Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:24 +01:00
Liu Bo
e89166990f Btrfs: fix deadlock in run_delalloc_nocow
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.

cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:19 +01:00
Darrick J. Wong
76883f7988 xfs: remove experimental tag for reverse mapping
Reverse mapping has had a while to soak, so remove the experimental tag.
Now that we've landed space metadata cross-referencing in scrub, the
feature actually has a purpose.

Reject rmap filesystems with an rt device until the code to support it
is actually implemented.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-02-01 21:07:26 -08:00
Darrick J. Wong
c14632ddac xfs: don't allow reflink + realtime filesystems
We don't support realtime filesystems with reflink either, so fail
those mounts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-02-01 21:06:16 -08:00
Darrick J. Wong
b6e03c10bf xfs: don't allow DAX on reflink filesystems
Now that reflink is no longer experimental, reject attempts to mount
with DAX until that whole mess gets sorted out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-02-01 21:06:15 -08:00
Eric Sandeen
494370ccaa xfs: add scrub to XFS_BUILD_OPTIONS
Advertise this config option along with the others.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-01 21:06:15 -08:00
Linus Torvalds
ab486bc9a5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:

 - Add a console_msg_format command line option:

     The value "default" keeps the old "[time stamp] text\n" format. The
     value "syslog" allows to see the syslog-like "<log
     level>[timestamp] text" format.

     This feature was requested by people doing regression tests, for
     example, 0day robot. They want to have both filtered and full logs
     at hands.

 - Reduce the risk of softlockup:

     Pass the console owner in a busy loop.

     This is a new approach to the old problem. It was first proposed by
     Steven Rostedt on Kernel Summit 2017. It marks a context in which
     the console_lock owner calls console drivers and could not sleep.
     On the other side, printk() callers could detect this state and use
     a busy wait instead of a simple console_trylock(). Finally, the
     console_lock owner checks if there is a busy waiter at the end of
     the special context and eventually passes the console_lock to the
     waiter.

     The hand-off works surprisingly well and helps in many situations.
     Well, there is still a possibility of the softlockup, for example,
     when the flood of messages stops and the last owner still has too
     much to flush.

     There is increasing number of people having problems with
     printk-related softlockups. We might eventually need to get better
     solution. Anyway, this looks like a good start and promising
     direction.

 - Do not allow to schedule in console_unlock() called from printk():

     This reverts an older controversial commit. The reschedule helped
     to avoid softlockups. But it also slowed down the console output.
     This patch is obsoleted by the new console waiter logic described
     above. In fact, the reschedule made the hand-off less effective.

 - Deprecate "%pf" and "%pF" format specifier:

     It was needed on ia64, ppc64 and parisc64 to dereference function
     descriptors and show the real function address. It is done
     transparently by "%ps" and "pS" format specifier now.

     Sergey Senozhatsky found that all the function descriptors were in
     a special elf section and could be easily detected.

 - Remove printk_symbol() API:

     It has been obsoleted by "%pS" format specifier, and this change
     helped to remove few continuous lines and a less intuitive old API.

 - Remove redundant memsets:

     Sergey removed unnecessary memset when processing printk.devkmsg
     command line option.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits)
  printk: drop redundant devkmsg_log_str memsets
  printk: Never set console_may_schedule in console_trylock()
  printk: Hide console waiter logic into helpers
  printk: Add console owner and waiter logic to load balance console writes
  kallsyms: remove print_symbol() function
  checkpatch: add pF/pf deprecation warning
  symbol lookup: introduce dereference_symbol_descriptor()
  parisc64: Add .opd based function descriptor dereference
  powerpc64: Add .opd based function descriptor dereference
  ia64: Add .opd based function descriptor dereference
  sections: split dereference_function_descriptor()
  openrisc: Fix conflicting types for _exext and _stext
  lib: do not use print_symbol()
  irq debug: do not use print_symbol()
  sysfs: do not use print_symbol()
  drivers: do not use print_symbol()
  x86: do not use print_symbol()
  unicore32: do not use print_symbol()
  sh: do not use print_symbol()
  mn10300: do not use print_symbol()
  ...
2018-02-01 13:36:15 -08:00
Al Viro
d85e2aa2e3 annotate ep_scan_ready_list()
make it always return __poll_t and have its callbacks do the same

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-01 16:30:06 -05:00
Al Viro
d7ebbe46f4 ep_send_events_proc(): return result via esed->res
preparations for not mixing __poll_t and int in ep_scan_ready_list()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-01 16:29:49 -05:00
Al Viro
cfe39442ab use linux/poll.h instead of asm/poll.h
The only place that has any business including asm/poll.h
is linux/poll.h.  Fortunately, asm/poll.h had only been
included in 3 places beyond that one, and all of them
are trivial to switch to using linux/poll.h.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-01 16:23:11 -05:00
Linus Torvalds
8e44e6600c Merge branch 'KASAN-read_word_at_a_time'
Merge KASAN word-at-a-time fixups from Andrey Ryabinin.

The word-at-a-time optimizations have caused headaches for KASAN, since
the whole point is that we access byte streams in bigger chunks, and
KASAN can be unhappy about the potential extra access at the end of the
string.

We used to have a horrible hack in dcache, and then people got
complaints from the strscpy() case.  This fixes it all up properly, by
adding an explicit helper for the "access byte stream one word at a
time" case.

* emailed patches from Andrey Ryabinin <aryabinin@virtuozzo.com>:
  fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports"
  fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()
  lib/strscpy: Shut up KASAN false-positives in strscpy()
  compiler.h: Add read_word_at_a_time() function.
  compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
2018-02-01 12:20:53 -08:00
Andrey Ryabinin
babcbbc7c4 fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports"
This reverts commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd.

It's no longer needed since dentry_string_cmp() now uses
read_word_at_a_time() to avoid kasan's reports.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 12:20:21 -08:00
Andrey Ryabinin
bfe7aa6c39 fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()
dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may
read slightly more than it was requested in kmallac().  Normally this
would make KASAN to report out-of-bounds access, but this was
workarounded by commit df4c0e36f1b1 ("fs: dcache: manually unpoison
dname after allocation to shut up kasan's reports").

This workaround is not perfect, since it allows out-of-bounds access to
dentry's name for all the code, not just in dentry_string_cmp().

So it would be better to use read_word_at_a_time() instead and revert
commit df4c0e36f1b1.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 12:20:21 -08:00
Andreas Gruenbacher
7ac07fdaf8 gfs2: Glock dump performance regression fix
Restore an optimization removed in commit 7f19449553 "Fix debugfs glocks
dump": keep the glock hash table iterator active while the glock dump
file is held open.  This avoids having to rescan the hash table from the
start for each read, with quadratically rising runtime.

In addition, use rhastable_walk_peek for resuming a glock dump at the
current position: when a glock doesn't fit in the provided buffer
anymore, the next read must revisit the same glock.

Finally, also restart the dump from the first entry when we notice that
the hash table has been resized in gfs2_glock_seq_start.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-02-01 11:27:11 -07:00
Andreas Gruenbacher
dcb2cd55cf gfs2: Fix the crc32c dependency
Depend on LIBCRC32C which uses the crypto API to select the appropriate
crc32c implementation.  With the CRYPTO and CRYPTO_CRC32C dependencies,
gfs2 would still need to use the crypto API directly like ext4 and btrfs
do, which isn't necessary.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-02-01 11:25:31 -07:00
Linus Torvalds
47fcc0360c Driver Core updates for 4.16-rc1
Here is the set of "big" driver core patches for 4.16-rc1.
 
 The majority of the work here is in the firmware subsystem, with reworks
 to try to attempt to make the code easier to handle in the long run, but
 no functional change.  There's also some tree-wide sysfs attribute
 fixups with lots of acks from the various subsystem maintainers, as well
 as a handful of other normal fixes and changes.
 
 And finally, some license cleanups for the driver core and sysfs code.
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLvPw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynNzACgkzjPoBytJWbpWFt6SR6L33/u4kEAnRFvVCGL
 s6ygQPQhZIjKk2Lxa2hC
 =Zihy
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of "big" driver core patches for 4.16-rc1.

  The majority of the work here is in the firmware subsystem, with
  reworks to try to attempt to make the code easier to handle in the
  long run, but no functional change. There's also some tree-wide sysfs
  attribute fixups with lots of acks from the various subsystem
  maintainers, as well as a handful of other normal fixes and changes.

  And finally, some license cleanups for the driver core and sysfs code.

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (48 commits)
  device property: Define type of PROPERTY_ENRTY_*() macros
  device property: Reuse property_entry_free_data()
  device property: Move property_entry_free_data() upper
  firmware: Fix up docs referring to FIRMWARE_IN_KERNEL
  firmware: Drop FIRMWARE_IN_KERNEL Kconfig option
  USB: serial: keyspan: Drop firmware Kconfig options
  sysfs: remove DEBUG defines
  sysfs: use SPDX identifiers
  drivers: base: add coredump driver ops
  sysfs: add attribute specification for /sysfs/devices/.../coredump
  test_firmware: fix missing unlock on error in config_num_requests_store()
  test_firmware: make local symbol test_fw_config static
  sysfs: turn WARN() into pr_warn()
  firmware: Fix a typo in fallback-mechanisms.rst
  treewide: Use DEVICE_ATTR_WO
  treewide: Use DEVICE_ATTR_RO
  treewide: Use DEVICE_ATTR_RW
  sysfs.h: Use octal permissions
  component: add debugfs support
  bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate
  ...
2018-02-01 10:00:28 -08:00
Linus Torvalds
5d8515bc23 Staging/IIO patches for 4.16-rc1
Here is the big Staging and IIO driver patches for 4.16-rc1.
 
 There is the normal amount of new IIO drivers added, like all releases.
 
 The networking IPX and the ncpfs filesystem are moved into the staging
 tree, as they are on their way out of the kernel due to lack of use
 anymore.
 
 The visorbus subsystem finall has started moving out of the staging tree
 to the "real" part of the kernel, and the most and fsl-mc codebases are
 almost ready to move out, that will probably happen for 4.17-rc1 if all
 goes well.
 
 Other than that, there is a bunch of license header cleanups in the
 tree, along with the normal amount of coding style churn that we all
 know and love for this codebase.  I also got frustrated at the
 Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
 huge chunks of it that were never even being used.
 
 Full details of everything is in the shortlog.
 
 All of these patches have been in linux-next for a while with no
 reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLxoA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk4vgCgjeMlwhtar65DIticIRj626EFxiQAnjGmH8Kd
 d9Xz2Piq8X47uSsC/6AE
 =xxMT
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO updates from Greg KH:
 "Here is the big Staging and IIO driver patches for 4.16-rc1.

  There is the normal amount of new IIO drivers added, like all
  releases.

  The networking IPX and the ncpfs filesystem are moved into the staging
  tree, as they are on their way out of the kernel due to lack of use
  anymore.

  The visorbus subsystem finall has started moving out of the staging
  tree to the "real" part of the kernel, and the most and fsl-mc
  codebases are almost ready to move out, that will probably happen for
  4.17-rc1 if all goes well.

  Other than that, there is a bunch of license header cleanups in the
  tree, along with the normal amount of coding style churn that we all
  know and love for this codebase. I also got frustrated at the
  Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
  huge chunks of it that were never even being used.

  Full details of everything is in the shortlog.

  All of these patches have been in linux-next for a while with no
  reported issues"

* tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
  staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
  staging: rtl8723bs: remove a couple of redundant initializations
  staging: comedi: reformat lines to 80 chars or less
  staging: lustre: separate a connection destroy from free struct kib_conn
  Staging: rtl8723bs: Use !x instead of NULL comparison
  Staging: rtl8723bs: Remove dead code
  Staging: rtl8723bs: Change names to conform to the kernel code
  staging: ccree: Fix missing blank line after declaration
  staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
  staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
  staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
  staging: comedi: dt2811: remove redundant initialization of 'ns'
  staging: wilc1000: fix alignments to match open parenthesis
  staging: wilc1000: removed unnecessary defined enums typedef
  staging: wilc1000: remove unnecessary use of parentheses
  staging: rtl8192u: remove redundant initialization of 'timeout'
  staging: sm750fb: fix CamelCase for dispSet var
  staging: lustre: lnet/selftest: fix compile error on UP build
  staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
  staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
  ...
2018-02-01 09:51:57 -08:00
Eric Biggers
0b1dfa4cc6 fscrypt: fix build with pre-4.6 gcc versions
gcc versions prior to 4.6 require an extra level of braces when using a
designated initializer for a member in an anonymous struct or union.
This caused a compile error with the 'struct qstr' initialization in
__fscrypt_encrypt_symlink().

Fix it by using QSTR_INIT().

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Fixes: 76e81d6d5048 ("fscrypt: new helper functions for ->symlink()")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-02-01 10:51:18 -05:00
Goffredo Baroncelli
c472c07bfe iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}
The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.

Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-02-01 08:15:25 -05:00
Darrick J. Wong
131fa58d39 xfs: fix u32 type usage in sb validation function
Don't use u32, use uint32_t, because this won't work in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-01-31 20:39:20 -08:00
Linus Torvalds
255442c938 Documentation updates for 4.16. New stuff includes refcount_t
documentation, errseq documentation, kernel-doc support for nested
 structure definitions, the removal of lots of crufty kernel-doc support for
 unused formats, SPDX tag documentation, the beginnings of a manual for
 subsystem maintainers, and lots of fixes and updates.
 
 As usual, some of the changesets reach outside of Documentation/ to effect
 kerneldoc comment fixes.  It also adds the new LICENSES directory, of which
 Thomas promises I do not need to be the maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
 Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
 p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
 HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
 bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
 WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
 QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
 vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
 p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
 puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
 XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
 XeQ7sbd+S+jzVVjFi0W5
 =Z15Z
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.16' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Documentation updates for 4.16.

  New stuff includes refcount_t documentation, errseq documentation,
  kernel-doc support for nested structure definitions, the removal of
  lots of crufty kernel-doc support for unused formats, SPDX tag
  documentation, the beginnings of a manual for subsystem maintainers,
  and lots of fixes and updates.

  As usual, some of the changesets reach outside of Documentation/ to
  effect kerneldoc comment fixes. It also adds the new LICENSES
  directory, of which Thomas promises I do not need to be the
  maintainer"

* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
  linux-next: docs-rst: Fix typos in kfigure.py
  linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
  Documentation: Fix misconversion of #if
  docs: add index entry for networking/msg_zerocopy
  Documentation: security/credentials.rst: explain need to sort group_list
  LICENSES: Add MPL-1.1 license
  LICENSES: Add the GPL 1.0 license
  LICENSES: Add Linux syscall note exception
  LICENSES: Add the MIT license
  LICENSES: Add the BSD-3-clause "Clear" license
  LICENSES: Add the BSD 3-clause "New" or "Revised" License
  LICENSES: Add the BSD 2-clause "Simplified" license
  LICENSES: Add the LGPL-2.1 license
  LICENSES: Add the LGPL 2.0 license
  LICENSES: Add the GPL 2.0 license
  Documentation: Add license-rules.rst to describe how to properly identify file licenses
  scripts: kernel_doc: better handle show warnings logic
  fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
  doc: md: Fix a file name to md-fault.c in fault-injection.txt
  errseq: Add to documentation tree
  ...
2018-01-31 19:25:25 -08:00
Linus Torvalds
dc1efc3cfa Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
 "Neil Brown's d_move()/d_path() race fix"

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: close race between getcwd() and d_move()
2018-01-31 19:15:23 -08:00
Linus Torvalds
73da9e1a9f Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - misc fixes

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  mm: remove PG_highmem description
  tools, vm: new option to specify kpageflags file
  mm/swap.c: make functions and their kernel-doc agree
  mm, memory_hotplug: fix memmap initialization
  mm: correct comments regarding do_fault_around()
  mm: numa: do not trap faults on shared data section pages.
  hugetlb, mbind: fall back to default policy if vma is NULL
  hugetlb, mempolicy: fix the mbind hugetlb migration
  mm, hugetlb: further simplify hugetlb allocation API
  mm, hugetlb: get rid of surplus page accounting tricks
  mm, hugetlb: do not rely on overcommit limit during migration
  mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
  mm, hugetlb: unify core page allocation accounting and initialization
  mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
  mm/memcontrol.c: make local symbol static
  mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
  include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
  mm/compaction.c: fix comment for try_to_compact_pages()
  mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
  zsmalloc: use U suffix for negative literals being shifted
  ...
2018-01-31 18:46:22 -08:00
Eric Biggers
284cd241a1 userfaultfd: convert to use anon_inode_getfd()
Nothing actually calls userfaultfd_file_create() besides the
userfaultfd() system call itself.  So simplify things by folding it into
the system call and using anon_inode_getfd() instead of
anon_inode_getfile().  Do the same in resolve_userfault_fork() as well.

This removes over 50 lines with no change in functionality.

Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau
ff62a34210 hugetlb: implement memfd sealing
Implements memfd sealing, similar to shmem:
 - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
   memfd_add_seals(). write() doesn't exist for hugetlbfs.
 - SHRINK: added similar check as shmem_setattr()
 - GROW: added similar check as shmem_setattr() & shmem_fallocate()

Except write() operation that doesn't exist with hugetlbfs, that should
make sealing as close as it can be to shmem support.

Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau
da14c1e524 hugetlb: expose hugetlbfs_inode_info in header
hugetlbfs inode information will need to be accessed by code in
mm/shmem.c for file sealing operations.  Move inode information
definition from .c file to header for needed access.

Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau
5aadc431a5 shmem: rename functions that are memfd-related
Those functions are called for memfd files, backed by shmem or hugetlb
(the next patches will handle hugetlb).

Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Kirill A. Shutemov
a3cf988fcb mm: use updated pmdp_invalidate() interface to track dirty/accessed bits
Use the modifed pmdp_invalidate() that returns the previous value of pmd
to transfer dirty and accessed bits.

Link: http://lkml.kernel.org/r/20171213105756.69879-12-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Daney <david.daney@cavium.com>
Cc: David Miller <davem@davemloft.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:38 -08:00
Matthew Wilcox
977fbdcd59 mm: add unmap_mapping_pages()
Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes.  Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that.  That's a sign of a bad interface.

Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.

Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:37 -08:00
Huang Ying
a365ac09d3 mm, userfaultfd, THP: avoid waiting when PMD under THP migration
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,

  do_page_fault()
    __do_huge_pmd_anonymous_page()
     handle_userfault()
       userfault_msg()
         /* a huge page is allocated and mapped at fault address */
         /* the huge page is under migration, leaves migration entry
            in page table */
       userfaultfd_must_wait()
         /* return true because !pmd_present() */
       /* may wait in loop until fatal signal */

That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present().  In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.

This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().

This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked.  mremap() will
write-lock mm->mmap_sem.  And UFFDIO_COPY doesn't support to copy THP
mapping.  But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.

Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:37 -08:00
Konstantin Khlebnikov
8526d84f81 fs/proc/task_mmu.c: do not show VmExe bigger than total executable virtual memory
If start_code / end_code pointers are screwed then "VmExe" could be
bigger than total executable virtual memory and "VmLib" becomes
negative:

  VmExe:	  294320 kB
  VmLib:	18446744073709327564 kB

VmExe and VmLib documented as text segment and shared library code size.

Now their sum will be always equal to mm->exec_vm which sums size of
executable and not writable and not stack areas.

I've seen this for huge (>2Gb) statically linked binary which has whole
world inside.  For it start_code ..  end_code range also covers one of
rodata sections.  Probably this is bug in customized linker, elf loader
or both.

Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
we cannot trust them without validation.

Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:37 -08:00
piaojun
d984187e3a ocfs2: return error when we attempt to access a dirty bh in jbd2
We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.

2. The bhs are submitted to jbd2 area successfully.

3. The write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.

4. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
   __ocfs2_journal_access():

   ocfs2_truncate_log_worker
     ocfs2_flush_truncate_log
       __ocfs2_flush_truncate_log
         ocfs2_replay_truncate_records
           ocfs2_journal_access_di
             __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.

6. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

Sadly we can hardly revert this but set fs read-only in case of ruining
atomicity and consistency of space reclaim.

Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Changwei Ge
e75ed71be4 ocfs2: unlock bh_state if bg check fails
We should unlock bh_stat if bg->bg_free_bits_count > bg->bg_bits

Link: http://lkml.kernel.org/r/1516843095-23680-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
c4c2416ab0 ocfs2: nowait aio support
Return EAGAIN if any of the following checks fail for direct I/O:

 - Cannot get the related locks immediately

 - Blocks are not allocated at the write location, it will trigger block
   allocation and block IO operations.

[ghe@suse.com: v4]
  Link: http://lkml.kernel.org/r/1516007283-29932-4-git-send-email-ghe@suse.com
[ghe@suse.com: v2]
  Link: http://lkml.kernel.org/r/1511944612-9629-4-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-4-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
ac604d3cdb ocfs2: add ocfs2_overwrite_io()
Add ocfs2_overwrite_io function, which is used to judge if overwrite
allocated blocks, otherwise, the write will bring extra block allocation
overhead.

[ghe@suse.com: v3]
  Link: http://lkml.kernel.org/r/1514455665-16325-3-git-send-email-ghe@suse.com
[ghe@suse.com: v2]
  Link: http://lkml.kernel.org/r/1511944612-9629-3-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-3-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
06e7f13d19 ocfs2: add ocfs2_try_rw_lock() and ocfs2_try_inode_lock()
Patch series "ocfs2: add nowait aio support", v4.

VFS layer has introduced the non-blocking aio flag IOCB_NOWAIT, which
tells the kernel to bail out if an AIO request will block for reasons
such as file allocations, or writeback triggering, or would block while
allocating requests while performing direct I/O.

Subsequently, pwritev2/preadv2 also can leverage this part of kernel
code.  So far, ext4/xfs/btrfs have supported this feature.  Add the
related code for the ocfs2 file system.

This patch (of 3):

Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be
used in non-blocking IO scenarios.

[ghe@suse.com: v2]
  Link: http://lkml.kernel.org/r/1511944612-9629-2-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1511775987-841-2-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: alex chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
637dd20c49 ocfs2: add trimfs lock to avoid duplicated trims in cluster
ocfs2 supports trimming the underlying disk via the fstrim command.  But
there is a problem, ocfs2 is a shared disk cluster file system, if the
user configures a scheduled fstrim job on each file system node, this
will trigger multiple nodes trimming a shared disk simultaneously, which
is very wasteful for CPU and IO consumption.  This also might negatively
affect the lifetime of poor-quality SSD devices.

So we introduce a trimfs dlm lock to communicate with each other in this
case, which will make only one fstrim command to do the trimming on a
shared disk among the cluster.  The fstrim commands from the other nodes
should wait for the first fstrim to finish and return success directly,
to avoid running the same trim on the shared disk again.

Link: http://lkml.kernel.org/r/1513228484-2084-2-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
4882abebcc ocfs2: add trimfs dlm lock resource
Introduce a new dlm lock resource, which will be used to communicate
during fstrimming of an ocfs2 device from cluster nodes.

Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Changwei Ge
71a3694404 ocfs2: try to reuse extent block in dealloc without meta_alloc
A crash issue was reported by John Lightsey with a call trace as follows:

  ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
  ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
  ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
  ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
  dio_complete+0x19a/0x1a0
  do_blockdev_direct_IO+0x19dd/0x1eb0
  __blockdev_direct_IO+0x43/0x50
  ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
  generic_file_direct_write+0xb2/0x170
  __generic_file_write_iter+0xc3/0x1b0
  ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
  __vfs_write+0xae/0xf0
  vfs_write+0xb8/0x1b0
  SyS_write+0x4f/0xb0
  system_call_fastpath+0x16/0x75

The BUG code told that extent tree wants to grow but no metadata was
reserved ahead of time.  From my investigation into this issue, the root
cause it that although enough metadata is not reserved, there should be
enough for following use.  Rightmost extent is merged into its left one
due to a certain times of marking extent written.  Because during
marking extent written, we got many physically continuous extents.  At
last, an empty extent showed up and the rightmost path is removed from
extent tree.

Add a new mechanism to reuse extent block cached in dealloc which were
just unlinked from extent tree to solve this crash issue.

Criteria is that during marking extents *written*, if extent rotation
and merging results in unlinking extent with growing extent tree later
without any metadata reserved ahead of time, try to reuse those extents
in dealloc in which deleted extents are cached.

Also, this patch addresses the issue John reported that ::dw_zero_count
is not calculated properly.

After applying this patch, the issue John reported was gone.  Thanks for
the reproducer provided by John.  And this patch has passed
ocfs2-test(29 cases) suite running by New H3C Group.

[ge.changwei@h3c.com: fix static checker warnning]
  Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F29196AE@H3CMLB12-EX.srv.huawei-3com.com
[akpm@linux-foundation.org: brelse(NULL) is legal]
Link: http://lkml.kernel.org/r/1515479070-32653-2-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: John Lightsey <john@nixnuts.net>
Tested-by: John Lightsey <john@nixnuts.net>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Changwei Ge
63de8bd932 ocfs2: make metadata estimation accurate and clear
Current code assume that ::w_unwritten_list always has only one item on.
This is not right and hard to get understood.  So improve how to count
unwritten item.

Link: http://lkml.kernel.org/r/1515479070-32653-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: John Lightsey <john@nixnuts.net>
Tested-by: John Lightsey <john@nixnuts.net>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
piaojun
16c8d569f5 ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
The race between *set_acl and *get_acl will cause getting incomplete
xattr data as below:

  processA                                    processB

  ocfs2_set_acl
    ocfs2_xattr_set
      __ocfs2_xattr_set_handle

                                              ocfs2_get_acl_nolock
                                                ocfs2_xattr_get_nolock:

processB may get incomplete xattr data if processA hasn't set_acl done.

So we should use 'ip_xattr_sem' to protect getting extended attribute in
ocfs2_get_acl_nolock(), as other processes could be changing it
concurrently.

Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Changwei Ge
d22aa61549 ocfs2: clean up dead code in alloc.c
Some stack variables are no longer used but still assigned.  Trim them.

Link: http://lkml.kernel.org/r/1516105069-12643-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
piaojun
c0a1a6d769 ocfs2/xattr: assign errno to 'ret' in ocfs2_calc_xattr_init()
We need catch the errno returned by ocfs2_xattr_get_nolock() and assign
it to 'ret' for printing and noticing upper callers.

Link: http://lkml.kernel.org/r/5A571CAF.8050709@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Gang He <ghe@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
Gang He
ff26cc10ae ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set.  The method is to
get a blocking lock and immediately unlock before returning, this can
avoid CPU resource waste due to lots of retries, and benefits fairness
in getting lock among multiple nodes, increase efficiency in case
modifying the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    <IRQ>
    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0
    </IRQ>
   RIP: 0010:unlock_page+0x17/0x30
   RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
   RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
   RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
   RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
   R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
   R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
   RIP: 0033:0x7fa75ded638e
   RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
   RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
   RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
   RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
   R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
   R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows.  I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
   1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
      5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
     95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
   2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
   2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
   2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun

  ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 14:44:52 CST 2017
  multi_mmap..................................................Passed.
  Runtime 783 seconds.

After apply this patch:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
    155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
     95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
   2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
      5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
   2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
    299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
    335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
    535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
   1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync

  ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
  Tests with "-b 4096 -C 32768"
  Thu Dec 28 15:04:12 CST 2017
  multi_mmap..................................................Passed.
  Runtime 487 seconds.

Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Eric Ren <zren@suse.com>
Acked-by: alex chen <alex.chen@huawei.com>
Acked-by: piaojun <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00
piaojun
025bcbde36 ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:

  ocfs2_mount
    ocfs2_initialize_super
      ocfs2_init_global_system_inodes
        ocfs2_iget
          ocfs2_read_locked_inode
            ocfs2_validate_inode_block
	      ocfs2_error
	        ocfs2_handle_error
	          ocfs2_set_ro_flag(osb, 0);  // set readonly

In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck.  And then mount again.  In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.

Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:35 -08:00