61974 Commits

Author SHA1 Message Date
Linus Torvalds
0aecba6173 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
 "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
  Basically, the problem is that negative pinned dentries require
  careful treatment - unless ->d_lock is locked or parent is held at
  least shared, another thread can make them positive right under us.

  Most of the uses turned out to be safe - the main surprises as far as
  filesystems are concerned were

   - race in dget_parent() fastpath, that might end up with the caller
     observing the returned dentry _negative_, due to insufficient
     barriers. It is positive in memory, but we could end up seeing the
     wrong value of ->d_inode in CPU cache. Fixed.

   - manual checks that result of lookup_one_len_unlocked() is positive
     (and rejection of negatives). Again, insufficient barriers (we
     might end up with inconsistent observed values of ->d_inode and
     ->d_flags). Fixed by switching to a new primitive that does the
     checks itself and returns ERR_PTR(-ENOENT) instead of a negative
     dentry. That way we get rid of boilerplate converting negatives
     into ERR_PTR(-ENOENT) in the callers and have a single place to
     deal with the barrier-related mess - inside fs/namei.c rather than
     in every caller out there.

  The guts of pathname resolution *do* need to be careful - the race
  found by Ritesh is real, as well as several similar races.
  Fortunately, it turns out that we can take care of that with fairly
  local changes in there.

  The tree-wide audit had not been fun, and I hate the idea of repeating
  it. I think the right approach would be to annotate the places where
  we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
  catch regressions. But I'm still not sure what would be the least
  invasive way of doing that and it's clearly the next cycle fodder"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namei.c: fix missing barriers when checking positivity
  fix dget_parent() fastpath race
  new helper: lookup_positive_unlocked()
  fs/namei.c: pull positivity check into follow_managed()
2019-12-06 09:06:58 -08:00
Linus Torvalds
b0d4beaa5a Merge branch 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull autofs updates from Al Viro:
 "autofs misuses checks for ->d_subdirs emptiness; the cursors are in
  the same lists, resulting in false negatives. It's not needed anyway,
  since autofs maintains counter in struct autofs_info, containing 0 for
  removed ones, 1 for live symlinks and 1 + number of children for live
  directories, which is precisely what we need for those checks.

  This series switches to use of that counter and untangles the crap
  around its uses (it needs not be atomic and there's a bunch of
  completely pointless "defensive" checks).

  This fell out of dcache_readdir work; the main point is to get rid of
  ->d_subdirs abuses in there. I've more followup cleanups, but I hadn't
  run those by Ian yet, so they can go next cycle"

* 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  autofs: don't bother with atomics for ino->count
  autofs_dir_rmdir(): check ino->count for deciding whether it's empty...
  autofs: get rid of pointless checks around ->count handling
  autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
2019-12-05 17:11:48 -08:00
Linus Torvalds
da73fcd8cf Merge branch 'pipe-rework' (patches from David Howells)
Merge two fixes for the pipe rework from David Howells:
 "Here are a couple of patches to fix bugs syzbot found in the pipe
  changes:

   - An assertion check will sometimes trip when polling a pipe because
     the ring size and indices used are approximate and may be being
     changed simultaneously.

     An equivalent approximate calculation was done previously, but
     without the assertion check, so I've just dropped the check. To
     make it accurate, the pipe mutex would need to be taken or the spin
     lock could be used - but usage of the spinlock would need to be
     rolled out into splice, iov_iter and other places for that.

   - The index mask and the max_usage values cannot be cached across
     pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
     This can cause pipe_write() to break"

* pipe-rework:
  pipe: Fix missing mask update after pipe_wait()
  pipe: Remove assertion from pipe_poll()
2019-12-05 16:35:53 -08:00
David Howells
8f868d68d3 pipe: Fix missing mask update after pipe_wait()
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:

  BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
  Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
  ...
  CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
  ...
  Call Trace:
    pipe_write+0xc25/0xe10 fs/pipe.c:481
    call_write_iter include/linux/fs.h:1895 [inline]
    new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
    __vfs_write+0x94/0x110 fs/read_write.c:496
    vfs_write+0x18a/0x520 fs/read_write.c:558
    ksys_write+0x105/0x220 fs/read_write.c:611
    __do_sys_write fs/read_write.c:623 [inline]
    __se_sys_write fs/read_write.c:620 [inline]
    __x64_sys_write+0x6e/0xb0 fs/read_write.c:620
    do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.

Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@kernel.org>
[ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 15:56:20 -08:00
David Howells
8c7b8c34ae pipe: Remove assertion from pipe_poll()
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size.  However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.

Fix this by simply removing the assertion and accepting that the
calculation might be approximate.

Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.

Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.

Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05 15:33:50 -08:00
Linus Torvalds
3f1266ec70 GFS2 changes for this merge window:
Bob's extensive filesystem withdrawal and recovery testing:
 - Don't write log headers after file system withdraw
 - clean up iopen glock mess in gfs2_create_inode
 - Close timing window with GLF_INVALIDATE_IN_PROGRESS
 - Abort gfs2_freeze if io error is seen
 - Don't loop forever in gfs2_freeze if withdrawn
 - fix infinite loop in gfs2_ail1_flush on io error
 - Introduce function gfs2_withdrawn
 - fix glock reference problem in gfs2_trans_remove_revoke
 
 Filesystems with a block size smaller than the page size:
 - Fix end-of-file handling in gfs2_page_mkwrite
 - Improve mmap write vs. punch_hole consistency
 
 Other:
 - Remove active journal side effect from gfs2_write_log_header
 - Multi-block allocations in gfs2_page_mkwrite
 
 Minor cleanups and coding style fixes:
 - Remove duplicate call from gfs2_create_inode
 - make gfs2_log_shutdown static
 - make gfs2_fs_parameters static
 - Some whitespace cleanups
 - removed unnecessary semicolon
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJd6VQ1AAoJENW/n+sDE2U6tUsP/2Zd64dVA+2KzpfvklAUNpx/
 TENbvRqGVvhofxuScY0oAvcg0KZN03K/L6ZCa71e7RI7T3VM/FG3rKNCkm084a9d
 r+Rxcz+eSkce5qKVNva6CtQiczGAj27iNiho0Y9IgtzEZ5KEVJuY3cV8Z2qzHrQM
 Rel2aVpUtaLhbFOj3jLGMt/HHSw8RTTzNqtJJCwRys/tVF1WPVqyNg4PD9a7zMT7
 z96tqrlQQxFT4SGiZVJwQHFGuQZEnbr2ahNRmivmGtnNnawLxpEdFuFrSAsC73UB
 wHO0Dq+7vYVTyQ7HugrqdkXxqyQr5ta06Pep7uj8ZvhoLWZvPuBf2SccpIn9Ufvo
 9iLFY5Z9cHg6wpsW+YMG75Mz0A6WbPRIScVog0fxaKW+z0vMZOmsT6hT6llAlCzn
 oj1igqVEOIBTS+4uMDIOJKvMixo4NTdgsLFQyftUxNHiCw5iGbqkV7ux31YjqX22
 A830zv8lba44BsixGtuPEy/0Dnka7rMfRp0cflCzKESSLIdXtSjUSEtS6g0fJISS
 qKNmnnkHpvjBGMG4lOqJihJdQ+2IiIMLWdNgxkvWgt6F7xjl3gcdREjJF0dmFsd+
 GfDUkUZ/70T9UPsaTR2V0GBvEleq4PglALcet9Eela7tKOEziNf3L+prRjMr3909
 y4uJIMH9/no1knKBxtkJ
 =b1m/
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Andreas Gruenbacher:
 "Bob's extensive filesystem withdrawal and recovery testing:
   - don't write log headers after file system withdraw
   - clean up iopen glock mess in gfs2_create_inode
   - close timing window with GLF_INVALIDATE_IN_PROGRESS
   - abort gfs2_freeze if io error is seen
   - don't loop forever in gfs2_freeze if withdrawn
   - fix infinite loop in gfs2_ail1_flush on io error
   - introduce function gfs2_withdrawn
   - fix glock reference problem in gfs2_trans_remove_revoke

  Filesystems with a block size smaller than the page size:
   - fix end-of-file handling in gfs2_page_mkwrite
   - improve mmap write vs. punch_hole consistency

  Other:
   - remove active journal side effect from gfs2_write_log_header
   - multi-block allocations in gfs2_page_mkwrite

  Minor cleanups and coding style fixes:
   - remove duplicate call from gfs2_create_inode
   - make gfs2_log_shutdown static
   - make gfs2_fs_parameters static
   - some whitespace cleanups
   - removed unnecessary semicolon"

* tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't write log headers after file system withdraw
  gfs2: Remove duplicate call from gfs2_create_inode
  gfs2: clean up iopen glock mess in gfs2_create_inode
  gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
  gfs2: Abort gfs2_freeze if io error is seen
  gfs2: Don't loop forever in gfs2_freeze if withdrawn
  gfs2: fix infinite loop in gfs2_ail1_flush on io error
  gfs2: Introduce function gfs2_withdrawn
  gfs2: fix glock reference problem in gfs2_trans_remove_revoke
  gfs2: make gfs2_log_shutdown static
  gfs2: Remove active journal side effect from gfs2_write_log_header
  gfs2: Fix end-of-file handling in gfs2_page_mkwrite
  gfs2: Multi-block allocations in gfs2_page_mkwrite
  gfs2: Improve mmap write vs. punch_hole consistency
  gfs2: make gfs2_fs_parameters static
  gfs2: Some whitespace cleanups
  gfs2: removed unnecessary semicolon
2019-12-05 13:20:11 -08:00
Linus Torvalds
a231582359 The two highlights are a set of improvements to how rbd read-only
mappings are handled and a conversion to the new mount API (slightly
 complicated by the fact that we had a common option parsing framework
 that called out into rbd and the filesystem instead of them calling
 into it).  Also included a few scattered fixes and a MAINTAINERS update
 for rbd, adding Dongsheng as a reviewer.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3oDqwTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8/OCACAhPEoSkG8J0XOgP0NlFQGRvikugKq
 wlUfpNhkJOVmyM1t9LgiHHirTa7/kA76wPo/iHtnvjIZuZoaX3+NoZX5DwgKVCo1
 SCQdXR4ohVPiYxUpK+z/fDXxpYHhaO2SAww+RRHSDxnlN5CHqFBcBhRBPfhraZT5
 dwiQt7++UOnp/hfk1Dqg5EogmSdLxqWyjClKf2lliZkzbU9YXmGapqQsur6sBk+e
 cLRmRBmMw4cDAKLL1taCympN0AxNMcePs1njvdwQ7XabNWrT061yFyt1ZNwAV/Nu
 0nCyh/9IwQcsR0EvK7FCdUEJPy88Reufd+GleS4nkEZpbxQBzo0aGow0
 =Egtk
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The two highlights are a set of improvements to how rbd read-only
  mappings are handled and a conversion to the new mount API (slightly
  complicated by the fact that we had a common option parsing framework
  that called out into rbd and the filesystem instead of them calling
  into it).

  Also included a few scattered fixes and a MAINTAINERS update for rbd,
  adding Dongsheng as a reviewer"

* tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client:
  libceph, rbd, ceph: convert to use the new mount API
  rbd: ask for a weaker incompat mask for read-only mappings
  rbd: don't query snapshot features
  rbd: remove snapshot existence validation code
  rbd: don't establish watch for read-only mappings
  rbd: don't acquire exclusive lock for read-only mappings
  rbd: disallow read-write partitions on images mapped read-only
  rbd: treat images mapped read-only seriously
  rbd: introduce RBD_DEV_FLAG_READONLY
  rbd: introduce rbd_is_snap()
  ceph: don't leave ino field in ceph_mds_request_head uninitialized
  ceph: tone down loglevel on ceph_mdsc_build_path warning
  rbd: update MAINTAINERS info
  ceph: fix geting random mds from mdsmap
  rbd: fix spelling mistake "requeueing" -> "requeuing"
  ceph: make several helper accessors take const pointers
  libceph: drop unnecessary check from dispatch() in mon_client.c
2019-12-05 13:06:51 -08:00
Linus Torvalds
7ce4fab819 fuse update for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXedyjQAKCRDh3BK/laaZ
 PCR7AQCf+bb3/so1bygFeblBTT4UZbYZRXz2nZNSA5tgJvafWwD+I4MlqR+tEixb
 gZEusbtAVrtm3hJrBc+1fA1wacGhmAg=
 =Jg/0
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:

 - Fix a regression introduced in the last release

 - Fix a number of issues with validating data coming from userspace

 - Some cleanups in virtiofs

* tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix Kconfig indentation
  fuse: fix leak of fuse_io_priv
  virtiofs: Use completions while waiting for queue to be drained
  virtiofs: Do not send forget request "struct list_head" element
  virtiofs: Use a common function to send forget
  virtiofs: Fix old-style declaration
  fuse: verify nlink
  fuse: verify write return
  fuse: verify attributes
2019-12-05 12:44:22 -08:00
Zorro Lang
c275779ff2 iomap: stop using ioend after it's been freed in iomap_finish_ioend()
This patch fixes the following KASAN report. The @ioend has been
freed by dio_put(), but the iomap_finish_ioend() still trys to access
its data.

[20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
[20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184

[20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
[20563.653887] Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
[20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
[20563.669547] Call trace:
[20563.671993]  dump_backtrace+0x0/0x370
[20563.675648]  show_stack+0x1c/0x28
[20563.678958]  dump_stack+0x138/0x1b0
[20563.682455]  print_address_description.isra.9+0x60/0x378
[20563.687759]  __kasan_report+0x1a4/0x2a8
[20563.691587]  kasan_report+0xc/0x18
[20563.694985]  __asan_report_load8_noabort+0x18/0x20
[20563.699769]  iomap_finish_ioend+0x58c/0x5c0
[20563.703944]  iomap_finish_ioends+0x110/0x270
[20563.708396]  xfs_end_ioend+0x168/0x598 [xfs]
[20563.712823]  xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.716834]  process_one_work+0x7f0/0x1ac8
[20563.720922]  worker_thread+0x334/0xae0
[20563.724664]  kthread+0x2c4/0x348
[20563.727889]  ret_from_fork+0x10/0x18

[20563.732941] Allocated by task 83403:
[20563.736512]  save_stack+0x24/0xb0
[20563.739820]  __kasan_kmalloc.isra.9+0xc4/0xe0
[20563.744169]  kasan_slab_alloc+0x14/0x20
[20563.747998]  slab_post_alloc_hook+0x50/0xa8
[20563.752173]  kmem_cache_alloc+0x154/0x330
[20563.756185]  mempool_alloc_slab+0x20/0x28
[20563.760186]  mempool_alloc+0xf4/0x2a8
[20563.763845]  bio_alloc_bioset+0x2d0/0x448
[20563.767849]  iomap_writepage_map+0x4b8/0x1740
[20563.772198]  iomap_do_writepage+0x200/0x8d0
[20563.776380]  write_cache_pages+0x8a4/0xed8
[20563.780469]  iomap_writepages+0x4c/0xb0
[20563.784463]  xfs_vm_writepages+0xf8/0x148 [xfs]
[20563.788989]  do_writepages+0xc8/0x218
[20563.792658]  __writeback_single_inode+0x168/0x18f8
[20563.797441]  writeback_sb_inodes+0x370/0xd30
[20563.801703]  wb_writeback+0x2d4/0x1270
[20563.805446]  wb_workfn+0x344/0x1178
[20563.808928]  process_one_work+0x7f0/0x1ac8
[20563.813016]  worker_thread+0x334/0xae0
[20563.816757]  kthread+0x2c4/0x348
[20563.819979]  ret_from_fork+0x10/0x18

[20563.825028] Freed by task 22184:
[20563.828251]  save_stack+0x24/0xb0
[20563.831559]  __kasan_slab_free+0x10c/0x180
[20563.835648]  kasan_slab_free+0x10/0x18
[20563.839389]  slab_free_freelist_hook+0xb4/0x1c0
[20563.843912]  kmem_cache_free+0x8c/0x3e8
[20563.847745]  mempool_free_slab+0x20/0x28
[20563.851660]  mempool_free+0xd4/0x2f8
[20563.855231]  bio_free+0x33c/0x518
[20563.858537]  bio_put+0xb8/0x100
[20563.861672]  iomap_finish_ioend+0x168/0x5c0
[20563.865847]  iomap_finish_ioends+0x110/0x270
[20563.870328]  xfs_end_ioend+0x168/0x598 [xfs]
[20563.874751]  xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.878755]  process_one_work+0x7f0/0x1ac8
[20563.882844]  worker_thread+0x334/0xae0
[20563.886584]  kthread+0x2c4/0x348
[20563.889804]  ret_from_fork+0x10/0x18

[20563.894855] The buggy address belongs to the object at fffffc0c54a36900
                which belongs to the cache bio-1 of size 248
[20563.906844] The buggy address is located 40 bytes inside of
                248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
[20563.918485] The buggy address belongs to the page:
[20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
[20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
[20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
[20563.948300] page dumped because: kasan: bad access detected

[20563.955345] Memory state around the buggy address:
[20563.960129]  fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.967342]  fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[20563.981766]                                   ^
[20563.986288]  fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.993501]  fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20564.000713] ==================================================================

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703
Signed-off-by: Zorro Lang <zlang@redhat.com>
Fixes: 9cd0ed63ca514 ("iomap: enhance writeback error message")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-05 07:41:16 -08:00
LimingWu
0b4295b5e2 io_uring: fix a typo in a comment
thatn -> than.

Signed-off-by: Liming Wu <19092205@suning.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 07:59:37 -07:00
Pavel Begunkov
4493233edc io_uring: hook all linked requests via link_list
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.

Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:52 -07:00
Pavel Begunkov
2e6e1fde32 io_uring: fix error handling in io_queue_link_head
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.

Stop consuming sqes, and let the user handle errors.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:51 -07:00
Alexey Dobriyan
658c033565 fs/binfmt_elf.c: extract elf_read() function
ELF reads done by the kernel have very complicated error detection code
which better live in one place.

Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Alexey Dobriyan
81696d5d54 fs/binfmt_elf.c: delete unused "interp_map_addr" argument
Link: http://lkml.kernel.org/r/20191005165049.GA26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Heiher
339ddb53d3 fs/epoll: remove unnecessary wakeups of nested epoll
Take the case where we have:

        t0
         | (ew)
        e0
         | (et)
        e1
         | (lt)
        s0

t0: thread 0
e0: epoll fd 0
e1: epoll fd 1
s0: socket fd 0
ew: epoll_wait
et: edge-trigger
lt: level-trigger

We remove unnecessary wakeups to prevent the nested epoll that working in edge-
triggered mode to waking up continuously.

Test code:
 #include <unistd.h>
 #include <sys/epoll.h>
 #include <sys/socket.h>

 int main(int argc, char *argv[])
 {
 	int sfd[2];
 	int efd[2];
 	struct epoll_event e;

 	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
 		goto out;

 	efd[0] = epoll_create(1);
 	if (efd[0] < 0)
 		goto out;

 	efd[1] = epoll_create(1);
 	if (efd[1] < 0)
 		goto out;

 	e.events = EPOLLIN;
 	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
 		goto out;

 	e.events = EPOLLIN | EPOLLET;
 	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
 		goto out;

 	if (write(sfd[1], "w", 1) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 0)
 		goto out;

 	close(efd[0]);
 	close(efd[1]);
 	close(sfd[0]);
 	close(sfd[1]);

 	return 0;

 out:
 	return -1;
 }

More tests:
 https://github.com/heiher/epoll-wakeup

Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc
Signed-off-by: hev <r@hev.cc>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Jason Baron
f6520c5208 epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOC
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested().  However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD.  This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a44 ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")

Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Krzysztof Kozlowski
3d82191c22 fs/proc/Kconfig: fix indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
        $ sed -e 's/^        /	/' -i */Kconfig

[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
70a731c0e3 fs/proc/internal.h: shuffle "struct pde_opener"
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.

Space savings:

	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
	Function                                     old     new   delta
	close_pdeo                                   228     227      -1
	proc_reg_release                              86      82      -4
	proc_entry_rundown                           143     139      -4
	proc_reg_open                                298     289      -9

Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
5f6354eaa5 fs/proc/generic.c: delete useless "len" variable
Pointer to next '/' encodes length of path element and next start
position.  Subtraction and increment are redundant.

Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Alexey Dobriyan
e06689bf57 proc: change ->nlink under proc_subdir_lock
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not.  Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.

Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Jens Axboe
78076bb64a io_uring: use hash table for poll command lookups
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 20:12:58 -07:00
Jens Axboe
08bdcc35f0 io-wq: clear node->next on list deletion
If someone removes a node from a list, and then later adds it back to
a list, we can have invalid data in ->next. This can cause all sorts
of issues. One such use case is the IORING_OP_POLL_ADD command, which
will do just that if we race and get woken twice without any pending
events. This is a pretty rare case, but can happen under extreme loads.
Dan reports that he saw the following crash:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
Oops: 0002 [#1] SMP
CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
RIP: 0010:io_wqe_enqueue+0x3e/0xd0
Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <IRQ>
 io_poll_wake+0x12f/0x2a0
 __wake_up_common+0x86/0x120
 __wake_up_common_lock+0x7a/0xc0
 sock_def_readable+0x3c/0x70
 tcp_rcv_established+0x557/0x630
 tcp_v6_do_rcv+0x118/0x3c0
 tcp_v6_rcv+0x97e/0x9d0
 ip6_protocol_deliver_rcu+0xe3/0x440
 ip6_input+0x3d/0xc0
 ? ip6_protocol_deliver_rcu+0x440/0x440
 ipv6_rcv+0x56/0xd0
 ? ip6_rcv_finish_core.isra.18+0x80/0x80
 __netif_receive_skb_one_core+0x50/0x70
 netif_receive_skb_internal+0x2f/0xa0
 napi_gro_receive+0x125/0x150
 mlx5e_handle_rx_cqe+0x1d9/0x5a0
 ? mlx5e_poll_tx_cq+0x305/0x560
 mlx5e_poll_rx_cq+0x49f/0x9c5
 mlx5e_napi_poll+0xee/0x640
 ? smp_reschedule_interrupt+0x16/0xd0
 ? reschedule_interrupt+0xf/0x20
 net_rx_action+0x286/0x3d0
 __do_softirq+0xca/0x297
 irq_exit+0x96/0xa0
 do_IRQ+0x54/0xe0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0033:0x7fdc627a2e3a
Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10

when running a networked workload with about 5000 sockets being polled
for. Fix this by clearing node->next when the node is being removed from
the list.

Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit")
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 17:26:57 -07:00
Jens Axboe
2d28390aff io_uring: ensure deferred timeouts copy necessary data
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160c6
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc8702 instead of having the timeout deferral use its
own.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 11:12:08 -07:00
Aurelien Aptel
9a7d5a9e6d cifs: fix possible uninitialized access and race on iface_list
iface[0] was accessed regardless of the count value and without
locking.

* check count before accessing any ifaces
* make copy of iface list (it's a simple POD array) and use it without
  locking.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2019-12-04 11:51:18 -06:00
Paulo Alcantara (SUSE)
3345bb44ba cifs: Fix lookup of SMB connections on multichannel
With the addition of SMB session channels, we introduced new TCP
server pointers that have no sessions or tcons associated with them.

In this case, when we started looking for TCP connections, we might
end up picking session channel rather than the master connection,
hence failing to get either a session or a tcon.

In order to fix that, this patch introduces a new "is_channel" field
to TCP_Server_Info structure so we can skip session channels during
lookup of connections.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-04 11:50:32 -06:00
Jens Axboe
901e59bba9 io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 10:34:03 -07:00
Christoph Hellwig
1cea335d1d iomap: fix sub-page uptodate handling
bio completions can race when a page spans more than one file system
block.  Add a spinlock to synchronize marking the page uptodate.

Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-04 09:33:52 -08:00
Mike Marshall
f9bbb68233 orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.

The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.

When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.

The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.

We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-12-04 08:52:55 -05:00
Gao Xiang
926d165017 erofs: zero out when listxattr is called with no xattr
As David reported [1], ENODATA returns when attempting
to modify files by using EROFS as an overlayfs lower layer.

The root cause is that listxattr could return unexpected
-ENODATA by mistake for inodes without xattr. That breaks
listxattr return value convention and it can cause copy
up failure when used with overlayfs.

Resolve by zeroing out if no xattr is found for listxattr.

[1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com
Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com
Fixes: cadf1ccf1b00 ("staging: erofs: add error handling for xattr submodule")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2019-12-04 21:15:04 +08:00
Brian Foster
798a9cada4 xfs: fix mount failure crash on invalid iclog memory access
syzbot (via KASAN) reports a use-after-free in the error path of
xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
handle the case of a fully initialized ->l_iclog linked list.
Instead, it assumes that the list is partially constructed and NULL
terminated.

This bug manifested because there was no possible error scenario
after iclog list setup when the original code was added.  Subsequent
code and associated error conditions were added some time later,
while the original error handling code was never updated. Fix up the
error loop to terminate either on a NULL iclog or reaching the end
of the list.

Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-03 14:53:07 -08:00
Steve French
43f8a6a74e smb3: query attributes on file close
Since timestamps on files on most servers can be updated at
close, and since timestamps on our dentries default to one
second we can have stale timestamps in some common cases
(e.g. open, write, close, stat, wait one second, stat - will
show different mtime for the first and second stat).

The SMB2/SMB3 protocol allows querying timestamps at close
so add the code to request timestamp and attr information
(which is cheap for the server to provide) to be returned
when a file is closed (it is not needed for the many
paths that call SMB2_close that are from compounded
query infos and close nor is it needed for some of
the cases where a directory close immediately follows a
directory open.

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-03 15:48:02 -06:00
Linus Torvalds
2a31aca500 New code for 5.5:
- Make iomap_dio_rw callers explicitly tell us if they want us to wait
 - Port the xfs writeback code to iomap to complete the buffered io
   library functions
 - Refactor the unshare code to share common pieces
 - Add support for performing copy on write with buffered writes
 - Other minor fixes
 - Fix unchecked return in iomap_bmap
 - Fix a type casting bug in a ternary statement in iomap_dio_bio_actor
 - Improve tracepoints for easier diagnostic ability
 - Fix pipe page leakage in directio reads
 - Clean up iter usage in directio paths
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3kAhYACgkQ+H93GTRK
 tOviwQ/9FyTPgZKDrJMOxiEIyx1ye+wWasaUSmt1/jaXBhX8B12WkjfV5Nk970pw
 WOdiu5rnHy+Aje35luZmwPpX1H8CPvnVUTCFaYdZe9r695+XBmpg0coPr9T3Lsg/
 dMRPB2LaM1gZxaQukBTWoj2mwDanepz6Zu5dGcwEuZE26jlnFf6+yQibFlOIjFmk
 ioCkZLwkNdQlF+dutOykIilkP6imFyTU1lYueAZTmtJoM7JfXcqVmsn7A4R2pi8U
 2CF/ySTCuKu8Mwee8BTzWMcNqvYc/bshF2bQeiBNnKB+K3GGM9s5oNL5DGQTc709
 Cjhw6aq8ZTedT6AEw07zY5X1NhkCe8dQImz1qQPjCuoW2a5bVww8hvw8otUAyxOM
 /ScwHqOH/TxzuTmYHoO+2DxZYjSEFTci26dpVenxWrFt82LcMfG/DvEThzTcT/qq
 EdH7oLHbgXpvpwMkOYU9usCCkosraWKNaN4aoX9R5VhhWA4t8CcQNpXBkMDEI3ck
 xywGGiwau3aUy2ELtjgfjxRUY28rZUkoABkyR9fAc4BYbS4RDqoHgF9pY/z5H/2B
 xwDLzCA9ww6wLjeXEWt9ZBwmkNkdsFp/XeKrx2WX81LrCRoqFc3viV13FWrfuEe6
 de9jHgV1HZqkmm5pChRMJ3n5Qyah9anXXnTRvar8UedHEbxoQuE=
 =QMWG
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap cleanups from Darrick Wong:
 "Aome more new iomap code for 5.5.

  There's not much this time -- just removing some local variables that
  don't need to exist in the iomap directio code"

* tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: remove unneeded variable in iomap_dio_rw()
  iomap: Do not create fake iter in iomap_dio_bio_actor()
2019-12-03 13:18:34 -08:00
Linus Torvalds
043cf46825 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main changes in the timer code in this cycle were:

   - Clockevent updates:

      - timer-of framework cleanups. (Geert Uytterhoeven)

      - Use timer-of for the renesas-ostm and the device name to prevent
        name collision in case of multiple timers. (Geert Uytterhoeven)

      - Check if there is an error after calling of_clk_get in asm9260
        (Chuhong Yuan)

   - ABI fix: Zero out high order bits of nanoseconds on compat
     syscalls. This got broken a year ago, with apparently no side
     effects so far.

     Since the kernel would use random data otherwise I don't think we'd
     have other options but to fix the bug, even if there was a side
     effect to applications (Dmitry Safonov)

   - Optimize ns_to_timespec64() on 32-bit systems: move away from
     div_s64_rem() which can be slow, to div_u64_rem() which is faster
     (Arnd Bergmann)

   - Annotate KCSAN-reported false positive data races in
     hrtimer_is_queued() users by moving timer->state handling over to
     the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
     (Eric Dumazet)

   - Misc cleanups and small fixes"

[ I undid the "ABI fix" and updated the comments instead. The reason
  there were apparently no side effects is that the fix was a no-op.

  The updated comment is to say _why_ it was a no-op.    - Linus ]

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Zero the upper 32-bits in __kernel_timespec on 32-bit
  time: Rename tsk->real_start_time to ->start_boottime
  hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
  time: Fix spelling mistake in comment
  time: Optimize ns_to_timespec64()
  hrtimer: Annotate lockless access to timer->state
  clocksource/drivers/asm9260: Add a check for of_clk_get
  clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
  clocksource/drivers/renesas-ostm: Convert to timer_of
  clocksource/drivers/timer-of: Use unique device name instead of timer
  clocksource/drivers/timer-of: Convert last full_name to %pOF
2019-12-03 12:20:25 -08:00
Jens Axboe
87f80d623c io_uring: handle connect -EINPROGRESS like -EAGAIN
Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 11:23:54 -07:00
Jackie Liu
8cdda87a44 io_uring: remove io_wq_current_is_worker
Since commit b18fdf71e01f ("io_uring: simplify io_req_link_next()"),
the io_wq_current_is_worker function is no longer needed, clean it
up.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jackie Liu
22efde5998 io_uring: remove parameter ctx of io_submit_state_start
Parameter ctx we have never used, clean it up.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
da8c969069 io_uring: mark us with IORING_FEAT_SUBMIT_STABLE
If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
f499a021ea io_uring: ensure async punted connect requests copy data
Just like commit f67676d160c6 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:30 -07:00
Jens Axboe
03b1230ca1 io_uring: ensure async punted sendmsg/recvmsg requests copy data
Just like commit f67676d160c6 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:03:35 -07:00
Jens Axboe
f67676d160 io_uring: ensure async punted read/write requests copy iovec
Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 21:33:25 -07:00
Omar Sandoval
69ffe5960d xfs: don't check for AG deadlock for realtime files in bunmapi
Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
in the wrong order. However, this check isn't applicable for realtime
files. In most cases, it just makes us do unnecessary commits. However,
without the fix from the previous commit ("xfs: fix realtime file data
space leak"), if the last and second-to-last extents also happen to have
different "AG numbers", then the break actually causes __xfs_bunmapi()
to return without making any progress, which sends
xfs_itruncate_extents_flags() into an infinite loop.

Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-02 17:58:51 -08:00
Omar Sandoval
0c4da70c83 xfs: fix realtime file data space leak
Realtime files in XFS allocate extents in rextsize units. However, the
written/unwritten state of those extents is still tracked in blocksize
units. Therefore, a realtime file can be split up into written and
unwritten extents that are not necessarily aligned to the realtime
extent size. __xfs_bunmapi() has some logic to handle these various
corner cases. Consider how it handles the following case:

1. The last extent is unwritten.
2. The last extent is smaller than the realtime extent size.
3. startblock of the last extent is not aligned to the realtime extent
   size, but startblock + blockcount is.

In this case, __xfs_bunmapi() calls xfs_bmap_add_extent_unwritten_real()
to set the second-to-last extent to unwritten. This should merge the
last and second-to-last extents, so __xfs_bunmapi() moves on to the
second-to-last extent.

However, if the size of the last and second-to-last extents combined is
greater than MAXEXTLEN, xfs_bmap_add_extent_unwritten_real() does not
merge the two extents. When that happens, __xfs_bunmapi() skips past the
last extent without unmapping it, thus leaking the space.

Fix it by only unwriting the minimum amount needed to align the last
extent to the realtime extent size, which is guaranteed to merge with
the last extent.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-02 17:58:50 -08:00
Jens Axboe
1a6b74fc87 io_uring: add general async offload context
Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:33 -07:00
Eric Biggers
490547ca2d block: don't send uevent for empty disk when not invalidating
Commit 6917d0689993 ("block: merge invalidate_partitions into
rescan_partitions") caused a regression where systemd-udevd spins
forever using max CPU starting at boot time.

It's caused by a behavior change where a KOBJ_CHANGE uevent is now sent
in a case where previously it wasn't.

Restore the old behavior.

Fixes: 6917d0689993 ("block: merge invalidate_partitions into rescan_partitions")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:30 -07:00
Jens Axboe
441cdbd544 io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR
We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:10 -07:00
Linus Torvalds
e3a251e366 This pull request contains mostly fixes for UBI, UBIFS and JFFS2:
UBI:
 - Fix a regression around producing a anchor PEB for fastmap.
   Due to a change in our locking fastmap was unable to produce
   fresh anchors an re-used the existing one a way to often.
 
 UBIFS:
 - Fixes for endianness. A few places blindly assumed little endian.
 - Fix for a memory leak in the orphan code.
 - Fix for a possible crash during a commit.
 - Revert a wrong bugfix.
 
 JFFS2:
 - Revert a bad bugfix in (false positive from a code checking
   tool).
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl3kG34WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wU70D/9K3QdG/fciafNRkVFFKnXVNXLv
 rxbfrRVCZeEo2Phm8daVQAvERWKi2Mw5jW1fNbOtDKvnPkC9+3vYmmHCienJ6VEt
 LhQ4Ey0MX8dQID1Caji49i5pPD3PBMX0m+YrxGt0uuzbr/nNWeiIUf/nyIvU95o7
 ZmmX7wWmKHUeLKC8TGpeolBVAf4xKqzp+U7NJxKpF5j6dn6hTi1J51GwsYw6aSXU
 lo5KvPWHq11ENWxJhtyWfO4MMyzTAtvbL1xEf0NRXX+aK4dUUyu0dcxzAqNzR6HU
 ER7gsIyEBKLZpgTqLpy6B9tWcnEjHHHN4tjxhD8gRGlOVBdXSlN+X45I8gjYOact
 Jkx1eF+T7E865YP+8vwoHo8eMoThxz54SnON8Dk6WaioAOGvf48w4gNnAilOb8Fa
 FZNk20pQdR2KyGc6PQ2I82vDYX1c2GmlGXOFLsEM8KxGWHjSCpQC1P946h17QSPe
 OxCptqu36czclm6zk76Ycw5/BWkbzViOcToVuQS61HLH7P/BzF1BnEM9rgwf+tD0
 4midOomxrZXmUrbC/HO4mARRdGtgonBHl1yJ8JjWhWEL9KPVLAhWJ4zHR8Zo84LJ
 BodZTzJPJ96igf+hLMAG/xFCnpGylXaz/G+W+5wUHnNJyKVsbhYIEmBscJYE60iM
 9/00Ce8r9lEpetUnKA==
 =LbMI
 -----END PGP SIGNATURE-----

Merge tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI/UBIFS/JFFS2 updates from Richard Weinberger:
 "This pull request contains mostly fixes for UBI, UBIFS and JFFS2:

  UBI:

   - Fix a regression around producing a anchor PEB for fastmap.

     Due to a change in our locking fastmap was unable to produce fresh
     anchors an re-used the existing one a way to often.

  UBIFS:

   - Fixes for endianness. A few places blindly assumed little endian.

   - Fix for a memory leak in the orphan code.

   - Fix for a possible crash during a commit.

   - Revert a wrong bugfix.

  JFFS2:

   - Revert a bad bugfix (false positive from a code checking tool)"

* tag 'upstream-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()"
  ubi: Fix producing anchor PEBs
  ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
  ubifs: do_kill_orphans: Fix a memory leak bug
  Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"
  ubifs: Fix type of sup->hash_algo
  ubifs: Fixed missed le64_to_cpu() in journal
  ubifs: Force prandom result to __le32
  ubifs: Remove obsolete TODO from dfs_file_write()
  ubi: Fix warning static is not at beginning of declaration
  ubi: Print skip_check in ubi_dump_vol_info()
2019-12-02 17:06:34 -08:00
Steve French
9e8fae2597 smb3: remove unused flag passed into close functions
close was relayered to allow passing in an async flag which
is no longer needed in this path.  Remove the unneeded parameter
"flags" passed in on close.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-12-02 18:07:17 -06:00
Colin Ian King
a9f76cf827 cifs: remove redundant assignment to pointer pneg_ctxt
The pointer pneg_ctxt is being initialized with a value that is never
read and it is being updated later with a new value.  The assignment
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-02 16:55:08 -06:00
Linus Torvalds
97eeb4d9d7 New code for 5.5:
- Fill out the build string
 - Prevent inode fork extent count overflows
 - Refactor the allocator to reduce long tail latency
 - Rework incore log locking a little to reduce spinning
 - Break up the xfs_iomap_begin functions into smaller more cohesive
 parts
 - Fix allocation alignment being dropped too early when the allocation
 request is for more blocks than an AG is large
 - Other small cleanups
 - Clean up file buftarg retrieval helpers
 - Hoist the resvsp and unresvsp ioctls to the vfs
 - Remove the undocumented biosize mount option, since it has never been
   mentioned as existing or supported on linux
 - Clean up some of the mount option printing and parsing
 - Enhance attr leaf verifier to check block structure
 - Check dirent and attr names for invalid characters before passing them
 to the vfs
 - Refactor open-coded bmbt walking
 - Fix a few places where we return EIO instead of EFSCORRUPTED after
 failing metadata sanity checks
 - Fix a synchronization problem between fallocate and aio dio corrupting
 the file length
 - Clean up various loose ends in the iomap and bmap code
 - Convert to the new mount api
 - Make sure we always log something when returning EFSCORRUPTED
 - Fix some problems where long running scrub loops could trigger soft
 lockup warnings and/or fail to exit due to fatal signals pending
 - Fix various Coverity complaints
 - Remove most of the function pointers from the directory code to reduce
 indirection penalties
 - Ensure that dquots are attached to the inode when performing unwritten
 extent conversion after io
 - Deuglify incore projid and crtime types
 - Fix another AGI/AGF locking order deadlock when renaming
 - Clean up some quota typedefs
 - Remove the FSSETDM ioctls which haven't done anything in 20 years
 - Fix some memory leaks when mounting the log fails
 - Fix an underflow when updating an xattr leaf freemap
 - Remove some trivial wrappers
 - Report metadata corruption as an error, not a (potentially) fatal
 assertion
 - Clean up the dir/attr buffer mapping code
 - Allow fatal signals to kill scrub during parent pointer checks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3fNjcACgkQ+H93GTRK
 tOv/8w//Y0Oa9Paiy8+iPTChs3/PqeKp307Fj5KONG52haMCakEJFT5+/wpkIAJw
 uUmKiPolwN1ivviIUmIS14ThTJ7NV1jq0G0h/0tC25i/3hoJrGWdzqYJMlvhlqgE
 taHrjCwPTDkhRJ0D5QCrkkHPU7lSdquO5TWxltaqYLhyLIt8SkklD6dN1dHWEPnk
 k0j3TL+VqVJDYyEj1bLwJ0QUb2C3J8ygWnlviF/WxsSeJtJpGoeLEaYXhhsUK0Dt
 aHg70OM6zzFzrJJAtJeBXpgaFsG/Pqbcw4wUWSxEMWjVSJwCSKLuZ5F+p6NcqoEj
 HeLQkaGePoO61YCInk2JKLHIyx7ohqMOt7+Dm0mdbe1pvcKwV9ZcdkqKa8L/Fm6v
 bUP6a2hEpsGy7vLnkYxwYACTLPbGX3uLw8MUr6ZpJ+SpfVLktU4ycpr8dCkJkp6a
 0qOpEeHsBDy74NkMOUa7Qrju7lJ2GiL70qqBwaPe+ubcUa3U/3WAsSekSzXgUwn8
 Fap4r8wn7cUbxymAvO06RlU8YymuulAlyjwdo9gOL/Su/5POldss6dy1YuUtyq19
 CD6NtkHqEUMsTc2cI+H65H44aEeckB1j0D2Grm2uMchAh0GcTSFVNF6jony++B8k
 s2sL2dEw9/9vr0uc1TSVF5ezxaONuyaCXdYXUkkdyq3iNvfpRCg=
 =aACq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "For this release, we changed quite a few things.

  Highlights:

   - Fixed some long tail latency problems in the block allocator

   - Removed some long deprecated (and for the past several years no-op)
     mount options and ioctls

   - Strengthened the extended attribute and directory verifiers

   - Audited and fixed all the places where we could return EFSCORRUPTED
     without logging anything

   - Refactored the old SGI space allocation ioctls to make the
     equivalent fallocate calls

   - Fixed a race between fallocate and directio

   - Fixed an integer overflow when files have more than a few
     billion(!) extents

   - Fixed a longstanding bug where quota accounting could be incorrect
     when performing unwritten extent conversion on a freshly mounted fs

   - Fixed various complaints in scrub about soft lockups and
     unresponsiveness to signals

   - De-vtable'd the directory handling code, which should make it
     faster

   - Converted to the new mount api, for better or for worse

   - Cleaned up some memory leaks

  and quite a lot of other smaller fixes and cleanups.

  A more detailed summary:

   - Fill out the build string

   - Prevent inode fork extent count overflows

   - Refactor the allocator to reduce long tail latency

   - Rework incore log locking a little to reduce spinning

   - Break up the xfs_iomap_begin functions into smaller more cohesive
     parts

   - Fix allocation alignment being dropped too early when the
     allocation request is for more blocks than an AG is large

   - Other small cleanups

   - Clean up file buftarg retrieval helpers

   - Hoist the resvsp and unresvsp ioctls to the vfs

   - Remove the undocumented biosize mount option, since it has never
     been mentioned as existing or supported on linux

   - Clean up some of the mount option printing and parsing

   - Enhance attr leaf verifier to check block structure

   - Check dirent and attr names for invalid characters before passing
     them to the vfs

   - Refactor open-coded bmbt walking

   - Fix a few places where we return EIO instead of EFSCORRUPTED after
     failing metadata sanity checks

   - Fix a synchronization problem between fallocate and aio dio
     corrupting the file length

   - Clean up various loose ends in the iomap and bmap code

   - Convert to the new mount api

   - Make sure we always log something when returning EFSCORRUPTED

   - Fix some problems where long running scrub loops could trigger soft
     lockup warnings and/or fail to exit due to fatal signals pending

   - Fix various Coverity complaints

   - Remove most of the function pointers from the directory code to
     reduce indirection penalties

   - Ensure that dquots are attached to the inode when performing
     unwritten extent conversion after io

   - Deuglify incore projid and crtime types

   - Fix another AGI/AGF locking order deadlock when renaming

   - Clean up some quota typedefs

   - Remove the FSSETDM ioctls which haven't done anything in 20 years

   - Fix some memory leaks when mounting the log fails

   - Fix an underflow when updating an xattr leaf freemap

   - Remove some trivial wrappers

   - Report metadata corruption as an error, not a (potentially) fatal
     assertion

   - Clean up the dir/attr buffer mapping code

   - Allow fatal signals to kill scrub during parent pointer checks"

* tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (198 commits)
  xfs: allow parent directory scans to be interrupted with fatal signals
  xfs: remove the mappedbno argument to xfs_da_get_buf
  xfs: remove the mappedbno argument to xfs_da_read_buf
  xfs: split xfs_da3_node_read
  xfs: remove the mappedbno argument to xfs_dir3_leafn_read
  xfs: remove the mappedbno argument to xfs_dir3_leaf_read
  xfs: remove the mappedbno argument to xfs_attr3_leaf_read
  xfs: remove the mappedbno argument to xfs_da_reada_buf
  xfs: improve the xfs_dabuf_map calling conventions
  xfs: refactor xfs_dabuf_map
  xfs: simplify mappedbno handling in xfs_da_{get,read}_buf
  xfs: report corruption only as a regular error
  xfs: Remove kmem_zone_free() wrapper
  xfs: Remove kmem_zone_destroy() wrapper
  xfs: Remove slab init wrappers
  xfs: fix attr leaf header freemap.size underflow
  xfs: fix some memory leaks in log recovery
  xfs: fix another missing include
  xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
  xfs: remove duplicated include from xfs_dir2_data.c
  ...
2019-12-02 14:46:22 -08:00
Deepa Dinamani
69738cfdfa fs: cifs: Fix atime update check vs mtime
According to the comment in the code and commit log, some apps
expect atime >= mtime; but the introduced code results in
atime==mtime.  Fix the comparison to guard against atime<mtime.

Fixes: 9b9c5bea0b96 ("cifs: do not return atime less than mtime")
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: stfrench@microsoft.com
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-02 15:15:35 -06:00