Commit Graph

50826 Commits

Author SHA1 Message Date
Dave Chinner
793d7dbe6d xfs: cancel dirty pages on invalidation
Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".

In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().

To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().

Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-16 12:11:56 -07:00
Eryu Guan
7e86600606 fs/binfmt_misc.c: node could be NULL when evicting inode
inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode.  e.g.  mount binfmt_misc filesystem then
umount it immediately:

  mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
  umount /proc/sys/fs/binfmt_misc

will result in

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
  IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
  ...
  Call Trace:
   evict+0xd3/0x1a0
   iput+0x17d/0x1d0
   dentry_unlink_inode+0xb9/0xf0
   __dentry_kill+0xc7/0x170
   shrink_dentry_list+0x122/0x280
   shrink_dcache_parent+0x39/0x90
   do_one_tree+0x12/0x40
   shrink_dcache_for_umount+0x2d/0x90
   generic_shutdown_super+0x1f/0x120
   kill_litter_super+0x29/0x40
   deactivate_locked_super+0x43/0x70
   deactivate_super+0x45/0x60
   cleanup_mnt+0x3f/0x70
   __cleanup_mnt+0x12/0x20
   task_work_run+0x86/0xa0
   exit_to_usermode_loop+0x6d/0x99
   syscall_return_slowpath+0xba/0xf0
   entry_SYSCALL_64_fastpath+0xa3/0xa

Fix it by making sure Node (e) is not NULL.

Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f918274e ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13 16:18:33 -07:00
Matthew Wilcox
f892760aa6 fs/mpage.c: fix mpage_writepage() for pages with buffers
When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
call clean_buffers() after unlocking the page we've written.  Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().

[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13 16:18:33 -07:00
Linus Torvalds
8ff0b97cf2 Changes since last update:
- Fix a stale kernel memory exposure when logging inodes.
 - Fix some build problems with CONFIG_XFS_RT=n
 - Don't change inode mode if the acl write fails, leaving the file totally
   inaccessible.
 - Fix a dangling pointer problem when removing an attr fork under memory
   pressure.
 - Don't crash while trying to invalidate a null buffer associated with a
   corrupt metadata pointer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZ3lPiAAoJEPh/dxk0SrTrfuMP/Axy7VSX71tE/eXPOmzxCVZD
 w4/usqO+OsQj+q8o+rwwuX9hz0VGF8kWZJOdgGdXpYT7pWqPmcf88wbThheTetLF
 fjevusqva0Ds+U4AE7DCNWSKQQRhu2jDgnhQXTv1hdYhWIF59qGwioIijbEvb72I
 0QW+/uV9yXmODjWL6KfRh9zRT9N4npMtszukScONwJr9t0/5ub8H03H/ktv8T9oi
 C3ljEWwyMk5lEYH8p6tpta8EbY0mrIZgo+kj33PU5s9rHvcrTGtyPNqidREUm1fL
 X3+STMytcDQFAcZdBBXHN0nFMwa8ADTrVvKmEgaR8OsXmOmrlcPn7HfVVlWrY31w
 X3awJ0b0+IXUrsbbQOPeqgTo5hIkMDkMOga5AP/rqpx1yCCOrlMHaRPXB2NxNcVw
 dyTj6IpKybhsQ4GkcqmFcgnxPPaogNpYlp6SXV5Dm+8zEJdIQNUuci/EGsNz7UcV
 msxNlJJkxczXOew6JzCyw45wTnJCxduX7Y1xrOTLaDfa9pkWO2zQBXukCJNIqVIq
 35Q4P4JVYtmwQr8XkkX9tiqU0gBWTCTG9KjmTCMm5MYkutEYM0uTNR5Jvyiobl7L
 Nn+RydssVw7ssnNfgsLhzQHPElUivRdYoYFSBa2DQp6ViILrefqQegd5INAjK63W
 7vnHVZyJMHPM0YFoiX8w
 =6Yvh
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix a stale kernel memory exposure when logging inodes.

 - Fix some build problems with CONFIG_XFS_RT=n

 - Don't change inode mode if the acl write fails, leaving the file
   totally inaccessible.

 - Fix a dangling pointer problem when removing an attr fork under
   memory pressure.

 - Don't crash while trying to invalidate a null buffer associated with
   a corrupt metadata pointer.

* tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: handle error if xfs_btree_get_bufs fails
  xfs: reinit btree pointer on attr tree inactivation walk
  xfs: Fix bool initialization/comparison
  xfs: don't change inode mode if ACL update fails
  xfs: move more RT specific code under CONFIG_XFS_RT
  xfs: Don't log uninitialised fields in inode structures
2017-10-12 14:51:13 -07:00
Linus Torvalds
3206e7d5e2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fix from Jan Kara:
 "A fix for a regression in handling of quota grace times and warnings"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
2017-10-12 10:56:06 -07:00
Eric Sandeen
93e8befc17 xfs: handle error if xfs_btree_get_bufs fails
Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:

XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000

_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8

We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.

Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:07 -07:00
Brian Foster
f35c5e10c6 xfs: reinit btree pointer on attr tree inactivation walk
xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.

The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.

Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.

To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:07 -07:00
Thomas Meyer
749f24f33e xfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:06 -07:00
Dave Chinner
67f2ffe31d xfs: don't change inode mode if ACL update fails
If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.

Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.

This fixes xfstests generic/449.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:06 -07:00
Dave Chinner
bb9c2e5433 xfs: move more RT specific code under CONFIG_XFS_RT
Various utility functions and interfaces that iterate internal
devices try to reference the realtime device even when RT support is
not compiled into the kernel.

Make sure this code is excluded from the CONFIG_XFS_RT=n build,
and where appropriate stub functions to return fatal errors if
they ever get called when RT support is not present.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:06 -07:00
Dave Chinner
20413e37d7 xfs: Don't log uninitialised fields in inode structures
Prevent kmemcheck from throwing warnings about reading uninitialised
memory when formatting inodes into the incore log buffer. There are
several issues here - we don't always log all the fields in the
inode log format item, and we never log the inode the
di_next_unlinked field.

In the case of the inode log format item, this is exacerbated
by the old xfs_inode_log_format structure padding issue. Hence make
the padded, 64 bit aligned version of the structure the one we always
use for formatting the log and get rid of the 64 bit variant. This
means we'll always log the 64-bit version and so recovery only needs
to convert from the unpadded 32 bit version from older 32 bit
kernels.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:06 -07:00
Alexander Levin
56ae414e9d 9p: set page uptodate when required in write_end()
Commit 77469c3f57 prevented setting the page as uptodate when we wrote
the right amount of data, fix that.

Fixes: 77469c3f57 ("9p: saner ->write_end() on failing copy into non-uptodate page")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-11 09:30:08 -07:00
Linus Torvalds
ce3861819a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Fairly old DIO bug caught by Andreas (3.10+) and several slightly
  younger blk_rq_map_user_iov() bugs, both on map and copy codepaths
  (Vitaly and me)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bio_copy_user_iov(): don't ignore ->iov_offset
  more bio_map_user_iov() leak fixes
  fix unbalanced page refcounting in bio_map_user_iov
  direct-io: Prevent NULL pointer access in submit_page_section
2017-10-11 09:00:22 -07:00
Andreas Gruenbacher
899f0429c7 direct-io: Prevent NULL pointer access in submit_page_section
In the code added to function submit_page_section by commit b1058b981,
sdio->bio can currently be NULL when calling dio_bio_submit.  This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.

Fixes xfstest generic/250 on gfs2.

Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10 23:10:02 -04:00
Linus Torvalds
f953d2481e One fix for a 4.14 regression, and one minor fix to the MAINTAINERs
file. (I was weirdly flattered by the idea that lots of random people
 suddenly seemed to think Jeff and I were VFS experts.  Turns out it was
 just a typo.)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZ3RyyAAoJECebzXlCjuG+JvQP/RkwFqMZJHDjhSDhj/cr/t2o
 ciK5Xche1A4E5vaaPVV17w6OwIYTNhnQawwBtNw88GaqDUEELVyFZFzNtRm44Bv1
 27RLOahPTT6bmHl/cd+uNpgpXs9svuNF6x4C5SUmKTm4kFdLBP7khjdcnFhwFi2y
 OerDFj4XmPsUDqW8dv7a7XktRf1klMvhbRh80r9TR5JW+h4IYQIYNevue9CABpUm
 4vvv4kAyxo8oodslCMQ5OyWpG4NDDsFADtlLn++9tzUl7y5j6TQyIYfeYDH3XOru
 5Ara5pkuxloS1Fu4EtEInF3iLAjMZkJD+QgHFhf2/mLMzQhZZzpbnFYPhrgyQONv
 wR3u7DaH2t/JbYtlSnKQpLEG0hv2hSBQ33G4ysKUHXrhnF5DC9N59epcA2X34++B
 DSwyc2wgxNfr8OGPyaNNw/kcBJyahNvsxlpTxZfTnvc0p4M1dzr1mxl/zsGC2b3v
 Ei1Y+u5JU2d/jmzeTOLCGtc59UyAoswdVzNa8SNYad1Tu5eAr81uooCPUvj77lTj
 GWQa9wYSOxt+Ld295dtzagqx+hQFdVKa+QTzfaZuPHeuUWmhQLGgalWXCxlVKtuF
 SGfAfutikQ4zbfAEz9PuNoThywfppiWbE74pfHRDkteL5+o2JQBLOSo6V6Ow0xV6
 O4cOvwV5X/RExbOoZlx1
 =yj7E
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fix from Bruce Fields:
 "One fix for a 4.14 regression, and one minor fix to the MAINTAINERs
  file. (I was weirdly flattered by the idea that lots of random people
  suddenly seemed to think Jeff and I were VFS experts. Turns out it was
  just a typo)"

* tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux:
  nfsd4: define nfsd4_secinfo_no_name_release()
  MAINTAINERS: associate linux/fs.h with VFS instead of file locking
2017-10-10 13:01:51 -07:00
Linus Torvalds
7056964a85 f2fs-for-4.14-rc5
This contains one bug fix which causes a kernel panic during fstrim introduced
 in 4.14-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlnc+awACgkQQBSofoJI
 UNIQLg/9HB/NikmBxVtkDtwrTKpVEPK5AYRHOvoa9k6twGkU6pB8FE0cd2PstwlZ
 tAwRstyt8W9nGzF5BPY+WAyVs9ybc26wIqNo13cnzwXbc0/cc4pTy8lzeiFQdQrK
 JIzz2lHNt0b5euCsEEAsnwK+rTb5DPUMKm8JkBUQ8f94oxIHLWvg7Um9FBppTw7s
 JNOJ8/ymzQVNlWu7VxFaVwfUPbEhK7gtpSWjO65fiprQ0JjwXLEr65356XU2XW8x
 lhQkByPMfMv1ZyGSNr3m4Hih0M6250slNHzwrZDxTdH7NDJmy1DfcPiM+epMWZMa
 4uT+2hsxhTCqDQbIEvP9jv+KVHV7AG9ldCD04a0RD+XoNKDVLKlzSMFWVcWE/d0H
 jSaDrMZj+taseF72x/efP8P/RrTbzqYsqBoAkoByibOXvBf7U8vsLK4NuG7agoL4
 EUXDMuVJDB5d8LJRSYt0lPI5R+lhRVlVuint7a9T09yiLyCeR0wGf+eoH9C9Y4V8
 t/mEM9azBi9l7T0yraVfqnh+SPzwwlxYOLQeZTi0bf3uqmBOeKb0OvfOiwboOnaZ
 5Rl6jYD/hgZAowXpbohRjqPJhMoLMabsTJ4kHj6uJcQDhvTqDpamm9g9Afsiyr6z
 xPYo09iHHlWA/iSiV7VSnbZu8hr59bchVt86r77fy/4YH3DXOcM=
 =fAsG
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fix from Jaegeuk Kim:
 "This contains one bug fix which causes a kernel panic during fstrim
  introduced in 4.14-rc1"

* tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: fix potential panic during fstrim
2017-10-10 11:04:00 -07:00
Jan Kara
ac3d79392f quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
Eryu has reported that since commit 7b9ca4c61b "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.

Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c61b
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-10 17:24:46 +02:00
Linus Torvalds
68ebe3cbe7 NFS client bugfixes for Linux 4.14
Hightlights include:
 
 stable fixes:
 - nfs/filelayout: fix oops when freeing filelayout segment
 - NFS: Fix uninitialized rpc_wait_queue
 
 bugfixes:
 - NFSv4/pnfs: Fix an infinite layoutget loop
 - nfs: RPC_MAX_AUTH_SIZE is in bytes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZ27KKAAoJEGcL54qWCgDybIIP/Ai9g9AQ52B7Id0VhcB40fZM
 Bn8I8nYbSzkOivL+w5DHW5eTg0spJ2+iEBjOucPkDWuK0hmeu7kDaIIfauaBTmcM
 dg2eQMVEaU8PnB0Bf9xMF1hR4Jf3laPVaW3Dnpl01+eJu0feQVf3EDJOzwDll5e6
 GDt8wuKXjfXZmHEVuvMvD/YSbzlLgKIyp62VRWXWMM73VUHL9YNc0VDaX6LTHzkM
 fYK+jWEgoq93/xuC2cP98+PyoziL82AYl7em0mcHTeffHm6FlB2KXrQq6dsW3UqI
 QMHQdqn6j+CWAv/PyJP+AifT/pTlvnor9ia4TVXlleWwrMSllUDCEttWi0jaBJxv
 OhaQgaQQEIGb6TLo7qbmHIX/VXxC1UMfjkx1Eqr4vu/Ps8y9t1Wy6V+pd86+QbzG
 qo/+jtFVHTMWIU9JBlowKoAJkeyeMfhL4cfSqcgdsSj9JJ2O/F/a/BFNh3bgui69
 TeSFLMoS0FCw9T2h2QeMCSwXvETmFDZR2pUXdsoULxYH0jZ4oPr7Fr9GflsSITwA
 oCITgkpt1oOoB5V/PrLPWfjq0JzcA69VAgmD1WJn5eNz1AvQErYYNU+VDf51T4rm
 zEAxk26WB7+KBBYMEyRCBeatnAAx0a28MFyYI7ittwovOkXIXOv/dw2bFZbSNyoc
 vpe4ZMGP442znvyy5Myh
 =QOH4
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include:

  stable fixes:
   - nfs/filelayout: fix oops when freeing filelayout segment
   - NFS: Fix uninitialized rpc_wait_queue

  bugfixes:
   - NFSv4/pnfs: Fix an infinite layoutget loop
   - nfs: RPC_MAX_AUTH_SIZE is in bytes"

* tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4/pnfs: Fix an infinite layoutget loop
  nfs/filelayout: fix oops when freeing filelayout segment
  sunrpc: remove redundant initialization of sock
  NFS: Fix uninitialized rpc_wait_queue
  NFS: Cleanup error handling in nfs_idmap_request_key()
  nfs: RPC_MAX_AUTH_SIZE is in bytes
2017-10-09 10:55:37 -07:00
Linus Torvalds
eab26ad197 Changes since last update:
- fix a race between overlapping copy on write aio
 - fix cow fork swapping when we defragment reflinked files
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZ1/WPAAoJEPh/dxk0SrTrk8AP/0rV3Cb6tknRTwNPHWC2KG+v
 UPP2KmN9tGPrqbrDTzMYdQC4/UNE4Je7+hMevF+A61Q7rug/4xofGP3Bl+vxWV22
 Y2lDA2jGHDnA20tvHvNUNJ+8aWbiHXXkzYCbohrlTHteDMaB+diHLp7jtePPrgzu
 ++qBM2X2noXhC3B6MB/GzEDUyTwHgEySsfx2IJDHs7LkQR5qV9UF8f1SSLbr9o7u
 N7JJ6CXUW5Dfb6Sxk8WJGEBHxTzf14vdPeTOmnsx1OwW9FFidVtcr8/YdY6Cv1F+
 LjpDuR/pWwJM0Ig1BB03jIcKNoG6Q6V1AJjNdZkq0hoEYc4Z8mNdyHPPSyvgMqqS
 733eMJI7q1Cu546XBP2NKmzUBJr4wVNPxTVbxZnbqrL1ybODTzKuDRkgpkoE8Hrg
 gSKXi4gnXJkR4/N5DPN+dP3cLMRl81QJ6widiZdpvxWzJGaOM1Ynu/o9mmo0yj7K
 rlHQ6tgex2TyuTys+jyPgRb489rf6eKnNTxu2I4F4nNbHsNOiNa8eVUc7FLP1SxL
 SfL2PUmUgcI1FcLl3yMZ2wZ3zP+PMV005aZB2q9KW08COF/ASXOX87efsQ91WaUy
 rEzOZoBxZvfc0DA0G5Tmlb+MbGtlfdjDfidPygmDeBrSRPJpUyxTh7xoRk/an8wL
 B4QtpX77Pj/qQNbuThkv
 =oDmt
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - fix a race between overlapping copy on write aio

 - fix cow fork swapping when we defragment reflinked files

* tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: handle racy AIO in xfs_reflink_end_cow
  xfs: always swap the cow forks when swapping extents
2017-10-06 15:53:36 -07:00
Linus Torvalds
bf2db0b9f5 Merge branch 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Two more fixes for bugs introduced in 4.13.

  The sector_t problem with 32bit architecture and !LBDAF config seems
  serious but the number of affected deployments is hopefully low.

  The clashing status bits could lead to a confusing in-memory state of
  the whole-filesystem operations if used with the quota override sysfs
  knob"

* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix overlap of fs_info::flags values
  btrfs: avoid overflow when sector_t is 32 bit
2017-10-06 09:03:08 -07:00
Linus Torvalds
b77779b93d Two fixups for CephFS snapshot-handling patches in -rc1.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZ142+AAoJEEp/3jgCEfOLPdQH/0wFtTLG7sKhEBVndsDUG8u0
 RUtLBE4dXFJU7IlLQOuAkD4GvC4XqttLIJs7bkUwSUu7Vk3+2OKk0JvUq2qKFl03
 tM5sWIqX5FkL9nenivV28YI6rOPHyghVXttVw/4xy5QYLJ1G3OoJpGPJOE44v5v9
 w96guw+EEaPWyn8+/SBhEkfpVAR2fRXe4UDKiLzGYLqYNYiGSSd90j/7F8I4uaNG
 hpQ6aJVJOzNoTQtfmsGyZ0DHuBD8/CSQOIumXdICegDk7stEVGaxSlkBX2ZwwR2q
 jwxIRj6ItM+jDORSgaVAhQ6NJktCxs+scfNFgu8MlQ+RaTOSnEkcvigA7DIVMrw=
 =h2CQ
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two fixups for CephFS snapshot-handling patches in -rc1"

* tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix __choose_mds() for LSSNAP request
  ceph: properly queue cap snap for newly created snap realm
2017-10-06 09:01:45 -07:00
Linus Torvalds
8d4ef4e15e Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix a regression in 4.14 and one in 4.13. The latter is a case when
  Docker is doing something it really shouldn't and gets away with it.
  We now print a warning instead of erroring out.

  There are also fixes to several error paths"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix regression caused by exclusive upper/work dir protection
  ovl: fix missing unlock_rename() in ovl_do_copy_up()
  ovl: fix dentry leak in ovl_indexdir_cleanup()
  ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
  ovl: fix error value printed in ovl_lookup_index()
  ovl: fix may_write_real() for overlayfs directories
2017-10-06 08:52:53 -07:00
Eryu Guan
ec572b9e81 nfsd4: define nfsd4_secinfo_no_name_release()
Commit 34b1744c91 ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.

But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.

I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:

 # mount a local-fs device at /mnt/test, and export it via NFS with
 # fsid=0 export option (this is required)
 mount /dev/sda5 /mnt/test
 echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
 service nfs restart

 # locally mount the nfs export with all default, note that I have
 # nfsv4.1 configured as the default nfs version, because of the
 # fsid export option, v4 mount would fail and fall back to v3
 mount localhost:/mnt/test /mnt/nfs

 # try to umount the underlying device, but got EBUSY
 umount /mnt/nfs
 service nfs stop
 umount /mnt/test <=== EBUSY here

Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.

Fixes: 34b1744c91 ("nfsd4: define ->op_release for compound ops")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-10-05 14:45:25 -04:00
Amir Goldstein
85fdee1eef ovl: fix regression caused by exclusive upper/work dir protection
Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.

Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:

Terminal 1:

  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/

Terminal 2:

  unshare -m

Terminal 1:

  umount merged
  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/
  mount: /root/overlay-testing/merged: none already mounted or mount point
         busy

To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.

Documentation was updated to reflect this change.

Fixes: 2cac0c00a6 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Amir Goldstein
5820dc0888 ovl: fix missing unlock_rename() in ovl_do_copy_up()
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Amir Goldstein
dc7ab6773e ovl: fix dentry leak in ovl_indexdir_cleanup()
index dentry was not released when breaking out of the loop
due to index verification error.

Fixes: 415543d5c6 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Amir Goldstein
9f4ec904db ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Amir Goldstein
e0082a0f04 ovl: fix error value printed in ovl_lookup_index()
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Amir Goldstein
954c736f86 ovl: fix may_write_real() for overlayfs directories
Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3c9 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Trond Myklebust
e8fa33a6f6 NFSv4/pnfs: Fix an infinite layoutget loop
Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.

This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.

Fixes: 70d2f7b1ea ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-04 14:06:54 -04:00
Linus Torvalds
b7e1416441 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "A lot of stuff, sorry about that. A week on a beach, then a bunch of
  time catching up then more time letting it bake in -next. Shan't do
  that again!"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
  include/linux/fs.h: fix comment about struct address_space
  checkpatch: fix ignoring cover-letter logic
  m32r: fix build failure
  lib/ratelimit.c: use deferred printk() version
  kernel/params.c: improve STANDARD_PARAM_DEF readability
  kernel/params.c: fix an overflow in param_attr_show
  kernel/params.c: fix the maximum length in param_get_string
  mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
  mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
  kernel/kcmp.c: drop branch leftover typo
  memremap: add scheduling point to devm_memremap_pages
  mm, page_alloc: add scheduling point to memmap_init_zone
  mm, memory_hotplug: add scheduling point to __add_pages
  lib/idr.c: fix comment for idr_replace()
  mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
  kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
  include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
  lib/lz4: make arrays static const, reduces object code size
  exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
  exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
  ...
2017-10-04 09:30:50 -07:00
Tsutomu Itoh
69ad59767d Btrfs: fix overlap of fs_info::flags values
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

  commit 171938e528 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

  commit f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:44:18 +02:00
Goffredo Baroncelli
2d8ce70a08 btrfs: avoid overflow when sector_t is 32 bit
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:22:56 +02:00
Casey Schaufler
57e7ba04d4 lsm: fix smack_inode_removexattr and xattr_getsecurity memleak
security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.

The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.

Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-10-04 18:03:15 +11:00
Christoph Hellwig
e12199f85d xfs: handle racy AIO in xfs_reflink_end_cow
If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert.  Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-03 21:27:55 -07:00
Darrick J. Wong
52bfcdd7ad xfs: always swap the cow forks when swapping extents
Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext.  We also need to
swap the extent counts and reset the cowblocks tags.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-03 21:27:55 -07:00
Oleg Nesterov
50097f7493 exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.

Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Oleg Nesterov
43a4f26190 exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.

Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.

Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: <tdhooge@llnl.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Oleg Nesterov
eb23aa0317 exec: binfmt_misc: remove the confusing e->interp_file != NULL checks
If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.

Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Oleg Nesterov
83f918274e exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()
To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.

The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid.  Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

And I can't understand why do we use filp_close(), not fput().

Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Oleg Nesterov
baba1b2973 exec: binfmt_misc: don't nullify Node->dentry in kill_node()
kill_node() nullifies/checks Node->dentry to avoid double free.  This
complicates the next changes and this is very confusing:

 - we do not need to check dentry != NULL under entries_lock,
   kill_node() is always called under inode_lock(d_inode(root)) and we
   rely on this inode_lock() anyway, without this lock the
   MISC_FMT_OPEN_FILE cleanup could race with itself.

 - if kill_inode() was already called and ->dentry == NULL we should not
   even try to close e->interp_file.

We can change bm_entry_write() to simply check !list_empty(list) before
kill_node.  Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().

Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Oleg Nesterov
c2315c187f exec: load_script: kill the onstack interp[BINPRM_BUF_SIZE] array
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".

It looks like this code was always wrong, then commit 948b701a60
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.

This patch (of 6):

load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().

The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.

While at it, do whitespace/style cleanups.

NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.

Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Andrea Arcangeli
384632e67e userfaultfd: non-cooperative: fix fork use after free
When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.

If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.

We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.

Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.

This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd).  If
it succeeded it was put back correctly.

Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released.  Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().

If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.

Fixes: 893e26e61d ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Chao Yu
638164a271 f2fs: fix potential panic during fstrim
As Ju Hyung Park reported:

"When 'fstrim' is called for manual trim, a BUG() can be triggered
randomly with this patch.

I'm seeing this issue on both x86 Desktop and arm64 Android phone.

On x86 Desktop, this was caused during Ubuntu boot-up. I have a
cronjob installed which calls 'fstrim -v /' during boot. On arm64
Android, this was caused during GC looping with 1ms gc_min_sleep_time
& gc_max_sleep_time."

Root cause of this issue is that f2fs_wait_discard_bios can only be
used by f2fs_put_super, because during put_super there must be no
other referrers, so it can ignore discard entry's reference count
when removing the entry, otherwise in other caller we will hit bug_on
in __remove_discard_cmd as there may be other issuer added reference
count in discard entry.

Thread A				Thread B
					- issue_discard_thread
- f2fs_ioc_fitrim
 - f2fs_trim_fs
  - f2fs_wait_discard_bios
   - __issue_discard_cmd
    - __submit_discard_cmd
					 - __wait_discard_cmd
					  - dc->ref++
					  - __wait_one_discard_bio
   - __wait_discard_cmd
    - __remove_discard_cmd
     - f2fs_bug_on(sbi, dc->ref)

Fixes: 969d1b180d
Reported-by: Ju Hyung Park <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-03 08:06:05 -07:00
Yan, Zheng
38f340ccdf ceph: fix __choose_mds() for LSSNAP request
previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
of directory inode" is buggy. It makes __choose_mds() choose mds
base on hash of '.snap' dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-10-02 16:18:16 +02:00
Yan, Zheng
9f4057fc93 ceph: properly queue cap snap for newly created snap realm
commit 3ae0bebc "ceph: queue cap snap only when snap realm's
context changes" introduced a regression: we may not call
queue_realm_cap_snaps() for newly created snap realm. This
regression allows unflushed snapshot data to be overwritten.

Link: http://tracker.ceph.com/issues/21483
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-10-02 16:18:01 +02:00
Scott Mayhew
0a47df11bf nfs/filelayout: fix oops when freeing filelayout segment
Check for a NULL dsaddr in filelayout_free_lseg() before calling
nfs4_fl_put_deviceid().  This fixes the following oops:

[ 1967.645207] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 1967.646010] IP: [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.646010] PGD c08bc067 PUD 915d3067 PMD 0
[ 1967.753036] Oops: 0000 [#1] SMP
[ 1967.753036] Modules linked in: nfs_layout_nfsv41_files ext4 mbcache jbd2 loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache amd64_edac_mod ipmi_ssif edac_mce_amd edac_core kvm_amd sg kvm ipmi_si ipmi_devintf irqbypass pcspkr k8temp ipmi_msghandler i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mptsas ttm scsi_transport_sas mptscsih drm mptbase serio_raw i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod
[ 1967.790031] CPU: 2 PID: 1370 Comm: ls Not tainted 3.10.0-709.el7.test.bz1463784.x86_64 #1
[ 1967.790031] Hardware name: IBM BladeCenter LS21 -[7971AC1]-/Server Blade, BIOS -[BAE155AUS-1.10]- 06/03/2009
[ 1967.790031] task: ffff8800c42a3f40 ti: ffff8800c4064000 task.ti: ffff8800c4064000
[ 1967.790031] RIP: 0010:[<ffffffffc06d6aea>]  [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.790031] RSP: 0000:ffff8800c4067978  EFLAGS: 00010246
[ 1967.790031] RAX: ffffffffc062f000 RBX: ffff8801d468a540 RCX: dead000000000200
[ 1967.790031] RDX: ffff8800c40679f8 RSI: ffff8800c4067a0c RDI: 0000000000000000
[ 1967.790031] RBP: ffff8800c4067980 R08: ffff8801d468a540 R09: 0000000000000000
[ 1967.790031] R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8801d468a540
[ 1967.790031] R13: ffff8800c40679f8 R14: ffff8801d5645300 R15: ffff880126f15ff0
[ 1967.790031] FS:  00007f11053c9800(0000) GS:ffff88012bd00000(0000) knlGS:0000000000000000
[ 1967.790031] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1967.790031] CR2: 0000000000000030 CR3: 0000000094b55000 CR4: 00000000000007e0
[ 1967.790031] Stack:
[ 1967.790031]  ffff8801d468a540 ffff8800c4067990 ffffffffc062d2fe ffff8800c40679b0
[ 1967.790031]  ffffffffc062b5b4 ffff8800c40679f8 ffff8801d468a540 ffff8800c40679d8
[ 1967.790031]  ffffffffc06d39af ffff8800c40679f8 ffff880126f16078 0000000000000001
[ 1967.790031] Call Trace:
[ 1967.790031]  [<ffffffffc062d2fe>] nfs4_fl_put_deviceid+0xe/0x10 [nfs_layout_nfsv41_files]
[ 1967.790031]  [<ffffffffc062b5b4>] filelayout_free_lseg+0x24/0x90 [nfs_layout_nfsv41_files]
[ 1967.790031]  [<ffffffffc06d39af>] pnfs_free_lseg_list+0x5f/0x80 [nfsv4]
[ 1967.790031]  [<ffffffffc06d5a67>] _pnfs_return_layout+0x157/0x270 [nfsv4]
[ 1967.790031]  [<ffffffffc06c17dd>] nfs4_evict_inode+0x4d/0x70 [nfsv4]
[ 1967.790031]  [<ffffffff8121de19>] evict+0xa9/0x180
[ 1967.790031]  [<ffffffff8121e729>] iput+0xf9/0x190
[ 1967.790031]  [<ffffffffc0652cea>] nfs_dentry_iput+0x3a/0x50 [nfs]
[ 1967.790031]  [<ffffffff8121ab4f>] shrink_dentry_list+0x20f/0x490
[ 1967.790031]  [<ffffffff8121b018>] d_invalidate+0xd8/0x150
[ 1967.790031]  [<ffffffffc065446b>] nfs_readdir_page_filler+0x40b/0x600 [nfs]
[ 1967.790031]  [<ffffffffc0654bbd>] nfs_readdir_xdr_to_array+0x20d/0x3b0 [nfs]
[ 1967.790031]  [<ffffffff811f3482>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
[ 1967.790031]  [<ffffffff81183208>] ? __add_to_page_cache_locked+0x48/0x170
[ 1967.790031]  [<ffffffffc0654d60>] ? nfs_readdir_xdr_to_array+0x3b0/0x3b0 [nfs]
[ 1967.790031]  [<ffffffffc0654d82>] nfs_readdir_filler+0x22/0x90 [nfs]
[ 1967.790031]  [<ffffffff8118351f>] do_read_cache_page+0x7f/0x190
[ 1967.790031]  [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
[ 1967.790031]  [<ffffffff8118366c>] read_cache_page+0x1c/0x30
[ 1967.790031]  [<ffffffffc0654f9b>] nfs_readdir+0x1ab/0x6b0 [nfs]
[ 1967.790031]  [<ffffffffc06bd1c0>] ? nfs4_xdr_dec_layoutget+0x270/0x270 [nfsv4]
[ 1967.790031]  [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
[ 1967.790031]  [<ffffffff81215c20>] vfs_readdir+0xb0/0xe0
[ 1967.790031]  [<ffffffff81216045>] SyS_getdents+0x95/0x120
[ 1967.790031]  [<ffffffff816b9449>] system_call_fastpath+0x16/0x1b
[ 1967.790031] Code: 90 31 d2 48 89 d0 5d c3 85 f6 74 f5 8d 4e 01 89 f0 f0 0f b1 0f 39 f0 74 e2 89 c6 eb eb 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 <48> 8b 47 30 48 89 fb a8 04 74 3b 8b 57 60 83 fa 02 74 19 8d 4a
[ 1967.790031] RIP  [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.790031]  RSP <ffff8800c4067978>
[ 1967.790031] CR2: 0000000000000030

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 1ebf980127 ("NFS/filelayout: Fix racy setting of fl->dsaddr...")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01 18:51:30 -04:00
Benjamin Coddington
68ebf8fe3b NFS: Fix uninitialized rpc_wait_queue
Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when
CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been
initialized.  Move the initialization of the queue out of the CONFIG_NFS_V4
conditional setion.

Fixes: 7d6ddf88c4 ("NFS: Add an iocounter wait function for async RPC tasks")
Cc: stable@vger.kernel.org # 4.11+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01 18:51:30 -04:00
Dan Carpenter
cdb2e53fd6 NFS: Cleanup error handling in nfs_idmap_request_key()
nfs_idmap_get_desc() can't actually return zero.  But if it did then
we would return ERR_PTR(0) which is NULL and the caller,
nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer
dereference.

I've cleaned this up by changing the "<=" to "<" so it's more clear that
we don't return ERR_PTR(0).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01 18:51:30 -04:00
J. Bruce Fields
35c036ef4a nfs: RPC_MAX_AUTH_SIZE is in bytes
The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words.  This causes
the client to request a larger-than-necessary session replay slot size.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01 18:51:30 -04:00