74781 Commits

Author SHA1 Message Date
Linus Torvalds
11fc88c2e4 New code for 5.17:
- Fix log recovery with da btree buffers when metauuid is in use.
  - Fix type coercion problems in xattr buffer size validation.
  - Fix a bug in online scrub dir leaf bestcount checking.
  - Only run COW recovery when recovering the log.
  - Fix symlink target buffer UAF problems and symlink locking problems
    by not exposing xfs innards to the VFS.
  - Fix incorrect quotaoff lock usage.
  - Don't let transactions cancel cleanly if they have deferred work
    items attached.
  - Fix a UAF when we're deciding if we need to relog an intent item.
  - Reduce kvmalloc overhead for log shadow buffers.
  - Clean up sysfs attr group usage.
  - Fix a bug where scrub's bmap/rmap checking could race with a quota
    file block allocation due to insufficient locking.
  - Teach scrub to complain about invalid project ids.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHXOL0ACgkQ+H93GTRK
 tOulDQ/+PwvFyROQePQo+5Z7fDb92VAKX6UxFYzWZGXj3zhRmTX7fvNSniPxKqHO
 K0GMnZ9rJxvuo/MRuSEPILl+2rKh59Efo9IPcs2TR+xO+vI/EKnch2/gPLBN1y7Y
 A4peQuK6CvUjYTyMazR2PcK8y76ZW8iXyuQMxnpmYldZBCZ+v7myVoVlPCRrZ68W
 EoEQ/cDf4fmsycSG4Eh9WTddbqd9bfl1+jXEz/pzlbhJSvQOaqynR+21ahxhEQ/f
 CzjSM4kXQ9AWTfHOPnEXs+Hf02NfcKRMuguzO69gJZhooEXJlgeYAZXUKqXpdmbC
 2Tc+x/rJnQv9PtzEw/WoWMem4rhwjBX7cVi2E/SJHB46MLI5rjhQilIdWzecWd8q
 nOrjIZlReI0vxYSqe1ifejxiCsV6gv3XdyLNacjxf7ISkU4XF0caO8f+3/ocIScV
 E7k+XN13pArUiRZXOt8UL571Hoa6nhtHjyAUg6i62AFI2R9XYUEjEXKMwX7ovRVO
 8eGdeQqioHkmOkF8HGyyurB9+itccuW8dtl7YOqEQJMGXeyk7rsvpcPahKKKoiDr
 8w3bpZlNiOfZEnRcNjO9B/Egu6LLMudti2ptwlUvfvWsfa8VTO8v0h5pqVhwuwrk
 lwQThWJ5THe/MptWybdZ3kSjhMvzqi4GO75T30Hj6XhecC3dFfo=
 =RYEb
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "The big new feature here is that the mount code now only bothers to
  try to free stale COW staging extents if the fs unmounted uncleanly.
  This should reduce mount times, particularly on filesystems supporting
  reflink and containing a large number of allocation groups.

  Everything else this cycle are bugfixes, as the iomap folios
  conversion should be plenty enough excitement for anyone. That and I
  ran out of brain bandwidth after Thanksgiving last year.

  Summary:

   - Fix log recovery with da btree buffers when metauuid is in use.

   - Fix type coercion problems in xattr buffer size validation.

   - Fix a bug in online scrub dir leaf bestcount checking.

   - Only run COW recovery when recovering the log.

   - Fix symlink target buffer UAF problems and symlink locking problems
     by not exposing xfs innards to the VFS.

   - Fix incorrect quotaoff lock usage.

   - Don't let transactions cancel cleanly if they have deferred work
     items attached.

   - Fix a UAF when we're deciding if we need to relog an intent item.

   - Reduce kvmalloc overhead for log shadow buffers.

   - Clean up sysfs attr group usage.

   - Fix a bug where scrub's bmap/rmap checking could race with a quota
     file block allocation due to insufficient locking.

   - Teach scrub to complain about invalid project ids"

* tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: warn about inodes with project id of -1
  xfs: hold quota inode ILOCK_EXCL until the end of dqalloc
  xfs: Remove redundant assignment of mp
  xfs: reduce kvmalloc overhead for CIL shadow buffers
  xfs: sysfs: use default_groups in kobj_type
  xfs: prevent UAF in xfs_log_item_in_current_chkpt
  xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
  xfs: Fix comments mentioning xfs_ialloc
  xfs: check sb_meta_uuid for dabuf buffer recovery
  xfs: fix a bug in the online fsck directory leaf1 bestcount check
  xfs: only run COW extent recovery when there are no live extents
  xfs: don't expose internal symlink metadata buffers to the vfs
  xfs: fix quotaoff mutex usage now that we don't support disabling it
  xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
2022-01-11 15:01:50 -08:00
Linus Torvalds
d601e58c5f for-5.17-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmHcX64ACgkQxWXV+ddt
 WDsIyA//UGMG0waxz40RksQL+4AnTIJ+apmmxq5VGEE8y6aUPAGKm++5Cs9cC+Os
 5xuZhNnD8hjB+eWH+tRCx8ll+T4g4skGKDCyD0fWNtZAlW8pnsIbztdO0nNYx9C0
 V++vu/hQR6M8E7ORlayEKBWy2/UnBG5p/XVLPG4RJ4vMETJPl2RLWVDpu3dj09kf
 YnD3AY0vmKEyCu/b9NtSzfZMO2/lXT2U41ezLJJfmPAXcMJ0EeSAazACVDQyq24p
 wnnr6xmdo7ZR0oGFLUmBmfxbKwd3l0JIUsi/XRysXe+8y7raIE0gKCg7NpCS276T
 BZrKmefxHhdMCA1HLlH6AkrKmQUgIaceLkXTanTYv4cnVzb6XoeV6R4IO9/JQ0yv
 YsdCL7eZ4Vl3ToPlQkYWdwUNP5UjVMg5qMwxchigbwq7jViLJtiu3WKy0O0TitzB
 n4o0cCdlv3lHRp8FS6cFmbCrsavFT8/q3vz/aRdkrojOKE0jEWVJiz0bf39j5BVO
 IiJ3RF3kbpG7TZz0+eNUbgebME8zwaWfyd6t2L7Ztvx9F4elzSW95iwGc9//TZvl
 ciNFI9LQvnZRymUItxg0HNXfbJMXJAE9ImTxA8HLkfQYpaBlwbt4avkQK56LhpN+
 nFCd5cxJy5HLFIvkRDfqpF+C247p5FICzd/hs59/cXlCynulglg=
 =9lg+
 -----END PGP SIGNATURE-----

Merge tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This end of the year branch is intentionally not that exciting. Most
  of the changes are under the hood, but there are some minor user
  visible improvements and several performance improvements too.

  Features:

   - make send work with concurrent block group relocation.

     We're not allowed to prevent send failing or silently producing
     some bad stream but with more fine grained locking and checks it's
     possible. The send vs deduplication exclusion could reuse the same
     logic in the future.

   - new exclusive operation 'balance paused' to allow adding a device
     to filesystem with paused balance

   - new sysfs file for fsid stored in the per-device directory to help
     distinguish devices when seeding is enabled, the fsid may differ
     from the one reported by the filesystem

  Performance improvements:

   - less metadata needed for directory logging, directory deletion is
     20-40% faster

   - in zoned mode, cache zone information during mount to speed up
     repeated queries (about 50% speedup)

   - free space tree entries get indexed and searched by size (latency
     -30%, search run time -30%)

   - less contention in tree node locking when inserting a key and no
     splits are needed (files/sec in fsmark improves by 1-20%)

  Fixes:

   - fix ENOSPC failure when attempting direct IO write into NOCOW range

   - fix deadlock between quota enable and other quota operations

   - global reserve minimum calculations fixed to account for free space
     tree

   - in zoned mode, fix condition for chunk allocation that may not find
     the right zone for reuse and could lead to early ENOSPC

  Core:

   - global reserve stealing got simplified and cleaned up in evict

   - remove async transaction commit based on manual transaction refs,
     reuse existing kthread and mechanisms to let it commit transaction
     before timeout

   - preparatory work for extent tree v2, add wrappers for global tree
     roots, truncation path cleanups

   - remove readahead framework, it's a bit overengineered and used only
     for scrub, and yet it does not cover all its needs, there is
     another readahead built in the b-tree search that is now used,
     performance drop on HDD is about 5% which is acceptable and scrub
     is often throttled anyway, on SSDs there's no reported drop but
     slight improvement

   - self tests report extent tree state when error occurs

   - replace assert with debugging information when an uncommitted
     transaction is found at unmount time

  Other:

   - error handling improvements

   - other cleanups and refactoring"

* tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: output more debug messages for uncommitted transaction
  btrfs: respect the max size in the header when activating swap file
  btrfs: fix argument list that the kdoc format and script verified
  btrfs: remove unnecessary parameter type from compression_decompress_bio
  btrfs: selftests: dump extent io tree if extent-io-tree test failed
  btrfs: scrub: cleanup the argument list of scrub_stripe()
  btrfs: scrub: cleanup the argument list of scrub_chunk()
  btrfs: remove reada infrastructure
  btrfs: scrub: use btrfs_path::reada for extent tree readahead
  btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity()
  btrfs: refactor unlock_up
  btrfs: skip transaction commit after failure to create subvolume
  btrfs: zoned: fix chunk allocation condition for zoned allocator
  btrfs: add extent allocator hook to decide to allocate chunk or not
  btrfs: zoned: unset dedicated block group on allocation failure
  btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned
  btrfs: zoned: sink zone check into btrfs_repair_one_zone
  btrfs: zoned: simplify btrfs_check_meta_write_pointer
  btrfs: zoned: encapsulate inode locking for zoned relocation
  btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device
  ...
2022-01-11 14:53:40 -08:00
Linus Torvalds
9149fe8ba7 Changes since last update:
- add sysfs interface and a sysfs node to control sync decompression;
 
  - add tail-packing inline support for compressed files;
 
  - get rid of erofs_get_meta_page().
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYduH/BEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBNPdAP9hKomD1hRiFeCWlLA1nDXYkqGbbt6+D3HT
 cm4G7DgVBAD+O+RWv6JVYg1zAAFlKmxqEKEfoDLKI65wAjH1V/h/dQE=
 =cEam
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "In this cycle, tail-packing data inline for compressed files is now
  supported so that tail pcluster can be stored and read together with
  inode metadata in order to save data I/O and storage space.

  In addition to that, to prepare for the upcoming subpage, folio and
  fscache features, we also introduce meta buffers to get rid of
  erofs_get_meta_page() since it was too close to the page itself.

  In addition, in order to show supported kernel features and control
  sync decompression strategy, new sysfs nodes are introduced in this
  cycle as well.

  Summary:

   - add sysfs interface and a sysfs node to control sync decompression

   - add tail-packing inline support for compressed files

   - get rid of erofs_get_meta_page()"

* tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: use meta buffers for zmap operations
  erofs: use meta buffers for xattr operations
  erofs: use meta buffers for super operations
  erofs: use meta buffers for inode operations
  erofs: introduce meta buffer operations
  erofs: add on-disk compressed tail-packing inline support
  erofs: support inline data decompression
  erofs: support unaligned data decompression
  erofs: introduce z_erofs_fixup_insize
  erofs: tidy up z_erofs_lz4_decompress
  erofs: clean up erofs_map_blocks tracepoints
  erofs: Replace zero-length array with flexible-array member
  erofs: add sysfs node to control sync decompression strategy
  erofs: add sysfs interface
  erofs: rename lz4_0pading to zero_padding
2022-01-11 14:51:10 -08:00
David Howells
d7bdba1c81 9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
In 9p, afs ceph, and nfs, gfpflags_allow_blocking() (which wraps a
test for __GFP_DIRECT_RECLAIM being set) is used to determine if
->releasepage() should wait for the completion of a DIO write to fscache
with something like:

	if (folio_test_fscache(folio)) {
		if (!gfpflags_allow_blocking(gfp) || !(gfp & __GFP_FS))
			return false;
		folio_wait_fscache(folio);
	}

Instead, current_is_kswapd() should be used instead.

Note that this is based on a patch originally by Zhaoyang Huang[1].  In
addition to extending it to the other network filesystems and putting it on
top of my fscache rewrite, it also needs to include linux/swap.h in a bunch
of places.  Can current_is_kswapd() be moved to linux/mm.h?

Changes
=======
ver :
 - Dropping the changes for cifs.

Originally-signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <smfrench@gmail.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: linux-cachefs@redhat.com
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/1638952658-20285-1-git-send-email-huangzhaoyang@gmail.com/ [1]
Link: https://lore.kernel.org/r/164021590773.640689.16777975200823659231.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:27:42 +00:00
Linus Torvalds
5dfbfe71e3 fs.idmapped.v5.17
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYdRCkgAKCRCRxhvAZXjc
 olrvAQCdp8LWkT8TauJSl8wmUm3mZhNy+5+fXuCUSwe3PyUtTQEAq4fxm41JpG8u
 WCZTrrxVhaXwgUY3aWzzeQnLCZjtEQw=
 =woqV
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This contains the work to enable the idmapping infrastructure to
  support idmapped mounts of filesystems mounted with an idmapping.

  In addition this contains various cleanups that avoid repeated
  open-coding of the same functionality and simplify the code in quite a
  few places.

  We also finish the renaming of the mapping helpers we started a few
  kernel releases back and move them to a dedicated header to not
  continue polluting the fs header needlessly with low-level idmapping
  helpers. With this series the fs header only contains idmapping
  helpers that interact with fs objects.

  Currently we only support idmapped mounts for filesystems mounted
  without an idmapping themselves. This was a conscious decision
  mentioned in multiple places (cf. [1]).

  As explained at length in [3] it is perfectly fine to extend support
  for idmapped mounts to filesystem's mounted with an idmapping should
  the need arise. The need has been there for some time now (cf. [2]).

  Before we can port any filesystem that is mountable with an idmapping
  to support idmapped mounts in the coming cycles, we need to first
  extend the mapping helpers to account for the filesystem's idmapping.
  This again, is explained at length in our documentation at [3] and
  also in the individual commit messages so here's an overview.

  Currently, the low-level mapping helpers implement the remapping
  algorithms described in [3] in a simplified manner as we could rely on
  the fact that all filesystems supporting idmapped mounts are mounted
  without an idmapping.

  In contrast, filesystems mounted with an idmapping are very likely to
  not use an identity mapping and will instead use a non-identity
  mapping. So the translation step from or into the filesystem's
  idmapping in the remapping algorithm cannot be skipped for such
  filesystems.

  Non-idmapped filesystems and filesystems not supporting idmapped
  mounts are unaffected by this change as the remapping algorithms can
  take the same shortcut as before. If the low-level helpers detect that
  they are dealing with an idmapped mount but the underlying filesystem
  is mounted without an idmapping we can rely on the previous shortcut
  and can continue to skip the translation step from or into the
  filesystem's idmapping. And of course, if the low-level helpers detect
  that they are not dealing with an idmapped mount they can simply
  return the relevant id unchanged; no remapping needs to be performed
  at all.

  These checks guarantee that only the minimal amount of work is
  performed. As before, if idmapped mounts aren't used the low-level
  helpers are idempotent and no work is performed at all"

Link: 2ca4dcc4909d ("fs/mount_setattr: tighten permission checks") [1]
Link: https://github.com/containers/podman/issues/10374 [2]
Link: Documentations/filesystems/idmappings.rst [3]
Link: a65e58e791a1 ("fs: document and rename fsid helpers") [4]

* tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: support mapped mounts of mapped filesystems
  fs: add i_user_ns() helper
  fs: port higher-level mapping helpers
  fs: remove unused low-level mapping helpers
  fs: use low-level mapping helpers
  docs: update mapping documentation
  fs: account for filesystem mappings
  fs: tweak fsuidgid_has_mapping()
  fs: move mapping helpers
  fs: add is_idmapped_mnt() helper
2022-01-11 14:26:55 -08:00
David Howells
e6435f1e02 fscache: Add a tracepoint for cookie use/unuse
Add a tracepoint to track fscache_use/unuse_cookie().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/164021588628.640689.12942919367404043608.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:13:01 +00:00
Jeff Layton
1702e79734 ceph: add fscache writeback support
When updating the backing store from the pagecache (a'la writepage or
writepages), write to the cache first. This allows us to keep caching
files even when they are being written, as long as we have appropriate
caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-3-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-3-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906985808.143852.1383891557313186623.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967190257.1823006.16713609520911954804.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021585020.640689.6765214932458435472.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:13:01 +00:00
Jeff Layton
400e1286c0 ceph: conversion to new fscache API
Now that the fscache API has been reworked and simplified, change ceph
over to use it.

With the old API, we would only instantiate a cookie when the file was
open for reads. Change it to instantiate the cookie when the inode is
instantiated and call use/unuse when the file is opened/closed.

Also, ensure we resize the cached data on truncates, and invalidate the
cache in response to the appropriate events. This will allow us to
plumb in write support later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:13:01 +00:00
Jan Kara
68514dacf2 select: Fix indefinitely sleeping task in poll_schedule_timeout()
A task can end up indefinitely sleeping in do_select() ->
poll_schedule_timeout() when the following race happens:

  TASK1 (thread1)             TASK2                   TASK1 (thread2)
  do_select()
    setup poll_wqueues table
    with 'fd'
                              write data to 'fd'
                                pollwake()
                                  table->triggered = 1
                                                      closes 'fd' thread1 is
                                                        waiting for
    poll_schedule_timeout()
      - sees table->triggered
      table->triggered = 0
      return -EINTR
    loop back in do_select()

But at this point when TASK1 loops back, the fdget() in the setup of
poll_wqueues fails.  So now so we never find 'fd' is ready for reading
and sleep in poll_schedule_timeout() indefinitely.

Treat an fd that got closed as a fd on which some event happened.  This
makes sure cannot block indefinitely in do_select().

Another option would be to return -EBADF in this case but that has a
potential of subtly breaking applications that excercise this behavior
and it happens to work for them.  So returning fd as active seems like a
safer choice.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-11 09:03:05 -08:00
Bob Peterson
74382e277a gfs2: dump inode object for iopen glocks
Before this patch, glock dumps would not dump the gl_object for iopen
glocks. This information can help us debug problems related to eviction:
when AN iopen glock is blocked we can see the status of its underlying
inode and its flags, etc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-01-11 16:52:44 +01:00
Dominique Martinet
19d1c32652 9p: fix enodata when reading growing file
Reading from a file that was just extended by a write, but the write had
not yet reached the server would return ENODATA as illustrated by this
command:
$ xfs_io -c 'open -ft test' -c 'w 4096 1000' -c 'r 0 1000'
wrote 1000/1000 bytes at offset 4096
1000.000000 bytes, 1 ops; 0.0001 sec (5.610 MiB/sec and 5882.3529 ops/sec)
pread: No data available

Fix this case by having netfs assume zeroes when reads from server come
short like AFS and CEPH do

Link: https://lkml.kernel.org/r/20220110111444.926753-1-asmadeus@codewreck.org
Cc: stable@vger.kernel.org
Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
Co-authored-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-11 15:21:53 +09:00
Linus Torvalds
8efd0d9c31 Networking changes for 5.17.
Core
 ----
 
  - Defer freeing TCP skbs to the BH handler, whenever possible,
    or at least perform the freeing outside of the socket lock section
    to decrease cross-CPU allocator work and improve latency.
 
  - Add netdevice refcount tracking to locate sources of netdevice
    and net namespace refcount leaks.
 
  - Make Tx watchdog less intrusive - avoid pausing Tx and restarting
    all queues from a single CPU removing latency spikes.
 
  - Various small optimizations throughout the stack from Eric Dumazet.
 
  - Make netdev->dev_addr[] constant, force modifications to go via
    appropriate helpers to allow us to keep addresses in ordered data
    structures.
 
  - Replace unix_table_lock with per-hash locks, improving performance
    of bind() calls.
 
  - Extend skb drop tracepoint with a drop reason.
 
  - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
 
 BPF
 ---
 
  - New helpers:
    - bpf_find_vma(), find and inspect VMAs for profiling use cases
    - bpf_loop(), runtime-bounded loop helper trading some execution
      time for much faster (if at all converging) verification
    - bpf_strncmp(), improve performance, avoid compiler flakiness
    - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
      for tracing programs, all inlined by the verifier
 
  - Support BPF relocations (CO-RE) in the kernel loader.
 
  - Further the support for BTF_TYPE_TAG annotations.
 
  - Allow access to local storage in sleepable helpers.
 
  - Convert verifier argument types to a composable form with different
    attributes which can be shared across types (ro, maybe-null).
 
  - Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
    creating new, extensible ones where missing and deprecating those
    to be removed.
 
 Protocols
 ---------
 
  - WiFi (mac80211/cfg80211):
    - notify user space about long "come back in N" AP responses,
      allow it to react to such temporary rejections
    - allow non-standard VHT MCS 10/11 rates
    - use coarse time in airtime fairness code to save CPU cycles
 
  - Bluetooth:
    - rework of HCI command execution serialization to use a common
      queue and work struct, and improve handling errors reported
      in the middle of a batch of commands
    - rework HCI event handling to use skb_pull_data, avoiding packet
      parsing pitfalls
    - support AOSP Bluetooth Quality Report
 
  - SMC:
    - support net namespaces, following the RDMA model
    - improve connection establishment latency by pre-clearing buffers
    - introduce TCP ULP for automatic redirection to SMC
 
  - Multi-Path TCP:
    - support ioctls: SIOCINQ, OUTQ, and OUTQNSD
    - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
      IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
    - support cmsgs: TCP_INQ
    - improvements in the data scheduler (assigning data to subflows)
    - support fastclose option (quick shutdown of the full MPTCP
      connection, similar to TCP RST in regular TCP)
 
  - MCTP (Management Component Transport) over serial, as defined by
    DMTF spec DSP0253 - "MCTP Serial Transport Binding".
 
 Driver API
 ----------
 
  - Support timestamping on bond interfaces in active/passive mode.
 
  - Introduce generic phylink link mode validation for drivers which
    don't have any quirks and where MAC capability bits fully express
    what's supported. Allow PCS layer to participate in the validation.
    Convert a number of drivers.
 
  - Add support to set/get size of buffers on the Rx rings and size of
    the tx copybreak buffer via ethtool.
 
  - Support offloading TC actions as first-class citizens rather than
    only as attributes of filters, improve sharing and device resource
    utilization.
 
  - WiFi (mac80211/cfg80211):
    - support forwarding offload (ndo_fill_forward_path)
    - support for background radar detection hardware
    - SA Query Procedures offload on the AP side
 
 New hardware / drivers
 ----------------------
 
  - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
    real-time requirements for isochronous communication with protocols
    like OPC UA Pub/Sub.
 
  - Qualcomm BAM-DMUX WWAN - driver for data channels of modems
    integrated into many older Qualcomm SoCs, e.g. MSM8916 or
    MSM8974 (qcom_bam_dmux).
 
  - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch
    driver with support for bridging, VLANs and multicast forwarding
    (lan966x).
 
  - iwlmei driver for co-operating between Intel's WiFi driver and
    Intel's Active Management Technology (AMT) devices.
 
  - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
 
  - Bluetooth:
    - MediaTek MT7921 SDIO devices
    - Foxconn MT7922A
    - Realtek RTL8852AE
 
 Drivers
 -------
 
  - Significantly improve performance in the datapaths of:
    lan78xx, ax88179_178a, lantiq_xrx200, bnxt.
 
  - Intel Ethernet NICs:
    - igb: support PTP/time PEROUT and EXTTS SDP functions on
      82580/i354/i350 adapters
    - ixgbevf: new PF -> VF mailbox API which avoids the risk of
      mailbox corruption with ESXi
    - iavf: support configuration of VLAN features of finer granularity,
      stacked tags and filtering
    - ice: PTP support for new E822 devices with sub-ns precision
    - ice: support firmware activation without reboot
 
  - Mellanox Ethernet NICs (mlx5):
    - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
    - support TC forwarding when tunnel encap and decap happen between
      two ports of the same NIC
    - dynamically size and allow disabling various features to save
      resources for running in embedded / SmartNIC scenarios
 
  - Broadcom Ethernet NICs (bnxt):
    - use page frag allocator to improve Rx performance
    - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
 
  - Other Ethernet NICs:
    - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
 
  - Microsoft cloud/virtual NIC (mana):
    - add XDP support (PASS, DROP, TX)
 
  - Mellanox Ethernet switches (mlxsw):
    - initial support for Spectrum-4 ASICs
    - VxLAN with IPv6 underlay
 
  - Marvell Ethernet switches (prestera):
    - support flower flow templates
    - add basic IP forwarding support
 
  - NXP embedded Ethernet switches (ocelot & felix):
    - support Per-Stream Filtering and Policing (PSFP)
    - enable cut-through forwarding between ports by default
    - support FDMA to improve packet Rx/Tx to CPU
 
  - Other embedded switches:
    - hellcreek: improve trapping management (STP and PTP) packets
    - qca8k: support link aggregation and port mirroring
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - qca6390, wcn6855: enable 802.11 power save mode in station mode
    - BSS color change support
    - WCN6855 hw2.1 support
    - 11d scan offload support
    - scan MAC address randomization support
    - full monitor mode, only supported on QCN9074
    - qca6390/wcn6855: report signal and tx bitrate
    - qca6390: rfkill support
    - qca6390/wcn6855: regdb.bin support
 
  - Intel WiFi (iwlwifi):
    - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
      in cooperation with the BIOS
    - support for Optimized Connectivity Experience (OCE) scan
    - support firmware API version 68
    - lots of preparatory work for the upcoming Bz device family
 
  - MediaTek WiFi (mt76):
    - Specific Absorption Rate (SAR) support
    - mt7921: 160 MHz channel support
 
  - RealTek WiFi (rtw88):
    - Specific Absorption Rate (SAR) support
    - scan offload
 
  - Other WiFi NICs
    - ath10k: support fetching (pre-)calibration data from nvmem
    - brcmfmac: configure keep-alive packet on suspend
    - wcn36xx: beacon filter support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S
 IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k
 h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O
 i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY
 AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz
 Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP
 NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF
 yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ
 0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N
 cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO
 z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J
 r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo=
 =KLm3
 -----END PGP SIGNATURE-----

Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core
  ----

   - Defer freeing TCP skbs to the BH handler, whenever possible, or at
     least perform the freeing outside of the socket lock section to
     decrease cross-CPU allocator work and improve latency.

   - Add netdevice refcount tracking to locate sources of netdevice and
     net namespace refcount leaks.

   - Make Tx watchdog less intrusive - avoid pausing Tx and restarting
     all queues from a single CPU removing latency spikes.

   - Various small optimizations throughout the stack from Eric Dumazet.

   - Make netdev->dev_addr[] constant, force modifications to go via
     appropriate helpers to allow us to keep addresses in ordered data
     structures.

   - Replace unix_table_lock with per-hash locks, improving performance
     of bind() calls.

   - Extend skb drop tracepoint with a drop reason.

   - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.

  BPF
  ---

   - New helpers:
      - bpf_find_vma(), find and inspect VMAs for profiling use cases
      - bpf_loop(), runtime-bounded loop helper trading some execution
        time for much faster (if at all converging) verification
      - bpf_strncmp(), improve performance, avoid compiler flakiness
      - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
        for tracing programs, all inlined by the verifier

   - Support BPF relocations (CO-RE) in the kernel loader.

   - Further the support for BTF_TYPE_TAG annotations.

   - Allow access to local storage in sleepable helpers.

   - Convert verifier argument types to a composable form with different
     attributes which can be shared across types (ro, maybe-null).

   - Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
     creating new, extensible ones where missing and deprecating those
     to be removed.

  Protocols
  ---------

   - WiFi (mac80211/cfg80211):
      - notify user space about long "come back in N" AP responses,
        allow it to react to such temporary rejections
      - allow non-standard VHT MCS 10/11 rates
      - use coarse time in airtime fairness code to save CPU cycles

   - Bluetooth:
      - rework of HCI command execution serialization to use a common
        queue and work struct, and improve handling errors reported in
        the middle of a batch of commands
      - rework HCI event handling to use skb_pull_data, avoiding packet
        parsing pitfalls
      - support AOSP Bluetooth Quality Report

   - SMC:
      - support net namespaces, following the RDMA model
      - improve connection establishment latency by pre-clearing buffers
      - introduce TCP ULP for automatic redirection to SMC

   - Multi-Path TCP:
      - support ioctls: SIOCINQ, OUTQ, and OUTQNSD
      - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
        IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
      - support cmsgs: TCP_INQ
      - improvements in the data scheduler (assigning data to subflows)
      - support fastclose option (quick shutdown of the full MPTCP
        connection, similar to TCP RST in regular TCP)

   - MCTP (Management Component Transport) over serial, as defined by
     DMTF spec DSP0253 - "MCTP Serial Transport Binding".

  Driver API
  ----------

   - Support timestamping on bond interfaces in active/passive mode.

   - Introduce generic phylink link mode validation for drivers which
     don't have any quirks and where MAC capability bits fully express
     what's supported. Allow PCS layer to participate in the validation.
     Convert a number of drivers.

   - Add support to set/get size of buffers on the Rx rings and size of
     the tx copybreak buffer via ethtool.

   - Support offloading TC actions as first-class citizens rather than
     only as attributes of filters, improve sharing and device resource
     utilization.

   - WiFi (mac80211/cfg80211):
      - support forwarding offload (ndo_fill_forward_path)
      - support for background radar detection hardware
      - SA Query Procedures offload on the AP side

  New hardware / drivers
  ----------------------

   - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
     real-time requirements for isochronous communication with protocols
     like OPC UA Pub/Sub.

   - Qualcomm BAM-DMUX WWAN - driver for data channels of modems
     integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974
     (qcom_bam_dmux).

   - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver
     with support for bridging, VLANs and multicast forwarding
     (lan966x).

   - iwlmei driver for co-operating between Intel's WiFi driver and
     Intel's Active Management Technology (AMT) devices.

   - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips

   - Bluetooth:
      - MediaTek MT7921 SDIO devices
      - Foxconn MT7922A
      - Realtek RTL8852AE

  Drivers
  -------

   - Significantly improve performance in the datapaths of: lan78xx,
     ax88179_178a, lantiq_xrx200, bnxt.

   - Intel Ethernet NICs:
      - igb: support PTP/time PEROUT and EXTTS SDP functions on
        82580/i354/i350 adapters
      - ixgbevf: new PF -> VF mailbox API which avoids the risk of
        mailbox corruption with ESXi
      - iavf: support configuration of VLAN features of finer
        granularity, stacked tags and filtering
      - ice: PTP support for new E822 devices with sub-ns precision
      - ice: support firmware activation without reboot

   - Mellanox Ethernet NICs (mlx5):
      - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
      - support TC forwarding when tunnel encap and decap happen between
        two ports of the same NIC
      - dynamically size and allow disabling various features to save
        resources for running in embedded / SmartNIC scenarios

   - Broadcom Ethernet NICs (bnxt):
      - use page frag allocator to improve Rx performance
      - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool

   - Other Ethernet NICs:
      - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support

   - Microsoft cloud/virtual NIC (mana):
      - add XDP support (PASS, DROP, TX)

   - Mellanox Ethernet switches (mlxsw):
      - initial support for Spectrum-4 ASICs
      - VxLAN with IPv6 underlay

   - Marvell Ethernet switches (prestera):
      - support flower flow templates
      - add basic IP forwarding support

   - NXP embedded Ethernet switches (ocelot & felix):
      - support Per-Stream Filtering and Policing (PSFP)
      - enable cut-through forwarding between ports by default
      - support FDMA to improve packet Rx/Tx to CPU

   - Other embedded switches:
      - hellcreek: improve trapping management (STP and PTP) packets
      - qca8k: support link aggregation and port mirroring

   - Qualcomm 802.11ax WiFi (ath11k):
      - qca6390, wcn6855: enable 802.11 power save mode in station mode
      - BSS color change support
      - WCN6855 hw2.1 support
      - 11d scan offload support
      - scan MAC address randomization support
      - full monitor mode, only supported on QCN9074
      - qca6390/wcn6855: report signal and tx bitrate
      - qca6390: rfkill support
      - qca6390/wcn6855: regdb.bin support

   - Intel WiFi (iwlwifi):
      - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
        in cooperation with the BIOS
      - support for Optimized Connectivity Experience (OCE) scan
      - support firmware API version 68
      - lots of preparatory work for the upcoming Bz device family

   - MediaTek WiFi (mt76):
      - Specific Absorption Rate (SAR) support
      - mt7921: 160 MHz channel support

   - RealTek WiFi (rtw88):
      - Specific Absorption Rate (SAR) support
      - scan offload

   - Other WiFi NICs
      - ath10k: support fetching (pre-)calibration data from nvmem
      - brcmfmac: configure keep-alive packet on suspend
      - wcn36xx: beacon filter support"

* tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits)
  tcp: tcp_send_challenge_ack delete useless param `skb`
  net/qla3xxx: Remove useless DMA-32 fallback configuration
  rocker: Remove useless DMA-32 fallback configuration
  hinic: Remove useless DMA-32 fallback configuration
  lan743x: Remove useless DMA-32 fallback configuration
  net: enetc: Remove useless DMA-32 fallback configuration
  cxgb4vf: Remove useless DMA-32 fallback configuration
  cxgb4: Remove useless DMA-32 fallback configuration
  cxgb3: Remove useless DMA-32 fallback configuration
  bnx2x: Remove useless DMA-32 fallback configuration
  et131x: Remove useless DMA-32 fallback configuration
  be2net: Remove useless DMA-32 fallback configuration
  vmxnet3: Remove useless DMA-32 fallback configuration
  bna: Simplify DMA setting
  net: alteon: Simplify DMA setting
  myri10ge: Simplify DMA setting
  qlcnic: Simplify DMA setting
  net: allwinner: Fix print format
  page_pool: remove spinlock in page_pool_refill_alloc_cache()
  amt: fix wrong return type of amt_send_membership_update()
  ...
2022-01-10 19:06:09 -08:00
Linus Torvalds
404dbad382 pstore update for v5.17-rc1
- Add boot param for early ftrace recording in pstore (Uwe Kleine-König)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmHV0TIWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJveID/0a/qRLfwNyjjLZ2MNtZLfL8ejs
 Qc/3wXJy7IoMe6OagbwS+boJ7PTOa3EMYh0h7GUGc8EjxFai+N2mGqEDMX/JZFRy
 zk1eWfNDpmCBrzo3VlOXrMM0Zb5cj3y9bfsLK31FcLN7gu+ivSXrA6BlPA513E+s
 SISPfK04tXJsLMBg12kKDkp8OcQ9ybKYDxUnScCFg+ef50pFdgzU6K79fUMGac8P
 M0n/ZDZRadUK8fCI2CChlOghWe09gWjWZIxstfn4qY+3C6R97ueB7IVCVgn/fbzX
 JIhsVmxjubQhd6AP/t5Ib5yzRFQ1O+p60LBGuwG8uA25DvnH3MfYhNRqpOB9R0a8
 3IygXuMcCGDl5LCBDteKrktdIKKCA8gLt0IydWD1xcVt8z7ux52/KTPyQZYYCPn4
 LvzACEL6O1qoPEod5k//BbF8p4A8JRBUPUCy55xtdQbOw/kg5H1fWg6xjGYW20Vs
 GpmuKKtDrAKX4OxIdLnaA5p3eBaF98OEo/d/H9PEUkYVc6xoPIGqwvepbt+N5Zpc
 vQ8FRqLTG+KvS3uUyXFlIdR5Rt/qwF0OoFxrMEEp0X/5aKwJHjW9v5qSkKehfQDI
 Ly4QsAaIZJx7w8FckGCBQzvTxfteMp5yivHW0ccOmnqxDLK04keoCxNSFBfP4FLQ
 4puMlfhXcL2ZO640bA==
 =Ux14
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore update from Kees Cook:

 - Add boot param for early ftrace recording in pstore (Uwe
   Kleine-König)

* tag 'pstore-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ftrace: Allow immediate recording
2022-01-10 11:48:37 -08:00
Yufan Chen
136dff3a6b ksmbd: add smb-direct shutdown
When killing ksmbd server after connecting rdma, ksmbd threads does not
terminate properly because the rdma connection is still alive.
This patch add shutdown operation to disconnect rdma connection while
ksmbd threads terminate.

Signed-off-by: Yufan Chen <wiz.chen@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Hyunchul Lee
4d02c4fdc0 ksmbd: smbd: change the default maximum read/write, receive size
Due to restriction that cannot handle multiple
buffer descriptor structures, decrease the maximum
read/write size for Windows clients.

And set the maximum fragmented receive size
in consideration of the receive queue size.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Hyunchul Lee
c9f189271c ksmbd: smbd: create MR pool
Create a memory region pool because rdma_rw_ctx_init()
uses memory registration if memory registration yields
better performance than using multiple SGE entries.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Namjae Jeon
41dbda16a0 ksmbd: add reserved room in ipc request/response
Whenever new parameter is added to smb configuration, It is possible
to break the execution of the IPC daemon by mismatch size of
request/response. This patch tries to reserve space in ipc request/response
in advance to prevent that.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Hyunchul Lee
99b7650ac5 ksmbd: smbd: call rdma_accept() under CM handler
if CONFIG_LOCKDEP is enabled, the following
kernel warning message is generated because
rdma_accept() checks whehter the handler_mutex
is held by lockdep_assert_held. CM(Connection
Manager) holds the mutex before CM handler
callback is called.

[   63.211405 ] WARNING: CPU: 1 PID: 345 at drivers/infiniband/core/cma.c:4405 rdma_accept+0x17a/0x350
[   63.212080 ] RIP: 0010:rdma_accept+0x17a/0x350
...
[   63.214036 ] Call Trace:
[   63.214098 ]  <TASK>
[   63.214185 ]  smb_direct_accept_client+0xb4/0x170 [ksmbd]
[   63.214412 ]  smb_direct_prepare+0x322/0x8c0 [ksmbd]
[   63.214555 ]  ? rcu_read_lock_sched_held+0x3a/0x70
[   63.214700 ]  ksmbd_conn_handler_loop+0x63/0x270 [ksmbd]
[   63.214826 ]  ? ksmbd_conn_alive+0x80/0x80 [ksmbd]
[   63.214952 ]  kthread+0x171/0x1a0
[   63.215039 ]  ? set_kthread_struct+0x40/0x40
[   63.215128 ]  ret_from_fork+0x22/0x30

To avoid this, move creating a queue pair and accepting
a client from transport_ops->prepare() to
smb_direct_handle_connect_request().

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Namjae Jeon
b589f5db6d ksmbd: limits exceeding the maximum allowable outstanding requests
If the client ignores the CreditResponse received from the server and
continues to send the request, ksmbd limits the requests if it exceeds
smb2 max credits.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Namjae Jeon
914d7e5709 ksmbd: move credit charge deduction under processing request
Moves the credit charge deduction from total_credits under the processing
a request. When repeating smb2 lock request and other command request,
there will be a problem that ->total_credits does not decrease.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Namjae Jeon
004443b3f6 ksmbd: add support for smb2 max credit parameter
Add smb2 max credits parameter to adjust maximum credits value to limit
number of outstanding requests.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Namjae Jeon
cb097b3dd5 ksmbd: set 445 port to smbdirect port by default
When SMB Direct is used with iWARP, Windows use 5445 port for smb direct
port, 445 port for SMB. This patch check ib_device using ib_client to
know if NICs type is iWARP or Infiniband.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Hyunchul Lee
31928a001b ksmbd: register ksmbd ib client with ib_register_client()
Register ksmbd ib client with ib_register_client() to find the rdma capable
network adapter. If ops.get_netdev(Chelsio NICs) is NULL, ksmbd will find
it using ib_device_get_by_netdev in old way.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10 12:44:19 -06:00
Theodore Ts'o
6eeaf88fd5 ext4: don't use the orphan list when migrating an inode
We probably want to remove the indirect block to extents migration
feature after a deprecation window, but until then, let's fix a
potential data loss problem caused by the fact that we put the
tmp_inode on the orphan list.  In the unlikely case where we crash and
do a journal recovery, the data blocks belonging to the inode being
migrated are also represented in the tmp_inode on the orphan list ---
and so its data blocks will get marked unallocated, and available for
reuse.

Instead, stop putting the tmp_inode on the oprhan list.  So in the
case where we crash while migrating the inode, we'll leak an inode,
which is not a disaster.  It will be easily fixed the next time we run
fsck, and it's better than potentially having blocks getting claimed
by two different files, and losing data as a result.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
2022-01-10 13:25:56 -05:00
xu xin
a2e3965df4 ext4: use BUG_ON instead of if condition followed by BUG
BUG_ON would be better.

This issue was detected with the help of Coccinelle.

Reported-by: Zeal robot <zealci@zte.com.cn>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Link: https://lore.kernel.org/r/20211228073252.580296-1-xu.xin16@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
Dan Carpenter
da9e480212 ext4: fix a copy and paste typo
This was obviously supposed to be an ext4 struct, not xfs.  GCC
doesn't care either way so it doesn't affect the build or runtime.

Fixes: cebe85d570cf ("ext4: switch to the new mount api")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211215114309.GB14552@kili
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
Luís Henriques
e81c9302a6 ext4: set csum seed in tmp inode while migrating to extents
When migrating to extents, the temporary inode will have it's own checksum
seed.  This means that, when swapping the inodes data, the inode checksums
will be incorrect.

This can be fixed by recalculating the extents checksums again.  Or simply
by copying the seed into the temporary inode.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357
Reported-by: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20211214175058.19511-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:56 -05:00
luo penghao
ae6ec194b5 ext4: remove unnecessary 'offset' assignment
Although it is in the loop, offset is reassigned at the beginning of the
while loop.  And after the loop, the value will not be used

The clang_analyzer complains as follows:

fs/ext4/dir.c:306:3 warning:

Value stored to 'offset' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Link: https://lore.kernel.org/r/20211208075307.404703-1-luo.penghao@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
luo penghao
a6dbc76c4d ext4: remove redundant o_start statement
The if will goto out of the loop, and until the end of the
function execution, o_start will not be used again.

The clang_analyzer complains as follows:

fs/ext4/move_extent.c:635:5 warning:

Value stored to 'o_start' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Link: https://lore.kernel.org/r/20211208075157.404535-1-luo.penghao@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
Adam Borowski
037e7c525d ext4: drop an always true check
EXT_FIRST_INDEX(ptr) is ptr+12, which can't possibly be null; gcc-12
warns about this.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20211115172020.57853-1-kilobyte@angband.pl
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
luo penghao
fac888b2be ext4: remove unused assignments
The eh assignment in these two places is meaningless, because the
function will goto to merge, which will not use eh.

The clang_analyzer complains as follows:

fs/ext4/extents.c:1988:4 warning:
fs/ext4/extents.c:2016:4 warning:

Value stored to 'eh' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Link: https://lore.kernel.org/r/20211104064007.2919-1-luo.penghao@zte.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
luo penghao
a660be97eb ext4: remove redundant statement
The local variable assignment at the end of the function is meaningless.

The clang_analyzer complains as follows:

fs/ext4/fast_commit.c:779:2 warning:

Value stored to 'dst' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Link: https://lore.kernel.org/r/20211104063406.2747-1-luo.penghao@zte.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:56 -05:00
Nghia Le
effc5b3b0d ext4: remove useless resetting io_end_size in mpage_process_page()
The command "make clang-analyzer" detects dead stores in
mpage_process_page() function.

Do not reset io_end_size to 0 in the current paths, as the function
exits on those paths without further using io_end_size.

Signed-off-by: Nghia Le <nghialm78@gmail.com>
Link: https://lore.kernel.org/r/20211025221803.3326-1-nghialm78@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
4a69aecbfb ext4: allow to change s_last_trim_minblks via sysfs
Ext4 has an optimization mechanism for batched disacrd (FITRIM) that
should help speed up subsequent calls of FITRIM ioctl by skipping the
groups that were previously trimmed. However because the FITRIM allows
to set the minimum size of an extent to trim, ext4 stores the last
minimum extent size and only avoids trimming the group if it was
previously trimmed with minimum extent size equal to, or smaller than
the current call.

There is currently no way to bypass the optimization without
umount/mount cycle. This becomes a problem when the file system is
live migrated to a different storage, because the optimization will
prevent possibly useful discard calls to the storage.

Fix it by exporting the s_last_trim_minblks via sysfs interface which
will allow us to set the minimum size to the number of blocks larger
than subsequent FITRIM call, effectively bypassing the optimization.

By setting the s_last_trim_minblks to ULONG_MAX the optimization will be
effectively cleared regardless of the previous state, or file system
configuration.

For example:
getconf ULONG_MAX > /sys/fs/ext4/dm-1/last_trim_minblks

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Laurent GUERBY <laurent@guerby.net>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20211103145122.17338-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
2327fb2e23 ext4: change s_last_trim_minblks type to unsigned long
There is no good reason for the s_last_trim_minblks to be atomic. There is
no data integrity needed and there is no real danger in setting and
reading it in a racy manner. Change it to be unsigned long, the same type
as s_clusters_per_group which is the maximum that's allowed.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
bbc605cdb1 ext4: implement support for get/set fs label
Implement support for FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls for
online reading and setting of file system label.

ext4_ioctl_getlabel() is simple, just get the label from the primary
superblock. This might not be the first sb on the file system if
'sb=' mount option is used.

In ext4_ioctl_setlabel() we update what ext4 currently views as a
primary superblock and then proceed to update backup superblocks. There
are two caveats:
 - the primary superblock might not be the first superblock and so it
   might not be the one used by userspace tools if read directly
   off the disk.
 - because the primary superblock might not be the first superblock we
   potentialy have to update it as part of backup superblock update.
   However the first sb location is a bit more complicated than the rest
   so we have to account for that.

The superblock modification is created generic enough so the
infrastructure can be used for other potential superblock modification
operations, such as chaning UUID.

Tested with generic/492 with various configurations. I also checked the
behavior with 'sb=' mount options, including very large file systems
with and without sparse_super/sparse_super2.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211213135618.43303-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
4c1bd5a90c ext4: only set EXT4_MOUNT_QUOTA when journalled quota file is specified
Only set EXT4_MOUNT_QUOTA when journalled quota file is specified,
otherwise simply disabling specific quota type (usrjquota=) will also
set the EXT4_MOUNT_QUOTA super block option.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Fixes: e6e268cb6822 ("ext4: move quota configuration out of handle_mount_opt()")
Link: https://lore.kernel.org/r/20220104143518.134465-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Lukas Czerner
13b215a9e6 ext4: don't use kfree() on rcu protected pointer sbi->s_qf_names
During ext4 mount api rework the commit e6e268cb6822 ("ext4: move quota
configuration out of handle_mount_opt()") introduced a bug where we
would kfree(sbi->s_qf_names[i]) before assigning the new quota name in
ext4_apply_quota_options().

This is wrong because we're using kfree() on rcu prointer that could be
simultaneously accessed from ext4_show_quota_options() during remount.
Fix it by using rcu_replace_pointer() to replace the old qname with the
new one and then kfree_rcu() the old quota name.

Also use get_qf_name() instead of sbi->s_qf_names in strcmp() to silence
the sparse warning.

Fixes: e6e268cb6822 ("ext4: move quota configuration out of handle_mount_opt()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220104143518.134465-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Jan Kara
173b6e383d ext4: avoid trim error on fs with small groups
A user reported FITRIM ioctl failing for him on ext4 on some devices
without apparent reason.  After some debugging we've found out that
these devices (being LVM volumes) report rather large discard
granularity of 42MB and the filesystem had 1k blocksize and thus group
size of 8MB. Because ext4 FITRIM implementation puts discard
granularity into minlen, ext4_trim_fs() declared the trim request as
invalid. However just silently doing nothing seems to be a more
appropriate reaction to such combination of parameters since user did
not specify anything wrong.

CC: Lukas Czerner <lczerner@redhat.com>
Fixes: 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211112152202.26614-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Zhang Yi
5c48a7df91 ext4: fix an use-after-free issue about data=journal writeback mode
Our syzkaller report an use-after-free issue that accessing the freed
buffer_head on the writeback page in __ext4_journalled_writepage(). The
problem is that if there was a truncate racing with the data=journalled
writeback procedure, the writeback length could become zero and
bget_one() refuse to get buffer_head's refcount, then the truncate
procedure release buffer once we drop page lock, finally, the last
ext4_walk_page_buffers() trigger the use-after-free problem.

sync                               truncate
ext4_sync_file()
 file_write_and_wait_range()
                                   ext4_setattr(0)
                                    inode->i_size = 0
  ext4_writepage()
   len = 0
   __ext4_journalled_writepage()
    page_bufs = page_buffers(page)
    ext4_walk_page_buffers(bget_one) <- does not get refcount
                                    do_invalidatepage()
                                      free_buffer_head()
    ext4_walk_page_buffers(page_bufs) <- trigger use-after-free

After commit bdf96838aea6 ("ext4: fix race between truncate and
__ext4_journalled_writepage()"), we have already handled the racing
case, so the bget_one() and bput_one() are not needed. So this patch
simply remove these hunk, and recheck the i_size to make it safe.

Fixes: bdf96838aea6 ("ext4: fix race between truncate and __ext4_journalled_writepage()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211225090937.712867-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Ye Bin
298b5c5217 ext4: fix null-ptr-deref in '__ext4_journal_ensure_credits'
We got issue as follows when run syzkaller test:
[ 1901.130043] EXT4-fs error (device vda): ext4_remount:5624: comm syz-executor.5: Abort forced by user
[ 1901.130901] Aborting journal on device vda-8.
[ 1901.131437] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.16: Detected aborted journal
[ 1901.131566] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.11: Detected aborted journal
[ 1901.132586] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.18: Detected aborted journal
[ 1901.132751] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.9: Detected aborted journal
[ 1901.136149] EXT4-fs error (device vda) in ext4_reserve_inode_write:6035: Journal has aborted
[ 1901.136837] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-fuzzer: Detected aborted journal
[ 1901.136915] ==================================================================
[ 1901.138175] BUG: KASAN: null-ptr-deref in __ext4_journal_ensure_credits+0x74/0x140 [ext4]
[ 1901.138343] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.13: Detected aborted journal
[ 1901.138398] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.1: Detected aborted journal
[ 1901.138808] Read of size 8 at addr 0000000000000000 by task syz-executor.17/968
[ 1901.138817]
[ 1901.138852] EXT4-fs error (device vda): ext4_journal_check_start:61: comm syz-executor.30: Detected aborted journal
[ 1901.144779] CPU: 1 PID: 968 Comm: syz-executor.17 Not tainted 4.19.90-vhulk2111.1.0.h893.eulerosv2r10.aarch64+ 
[ 1901.146479] Hardware name: linux,dummy-virt (DT)
[ 1901.147317] Call trace:
[ 1901.147552]  dump_backtrace+0x0/0x2d8
[ 1901.147898]  show_stack+0x28/0x38
[ 1901.148215]  dump_stack+0xec/0x15c
[ 1901.148746]  kasan_report+0x108/0x338
[ 1901.149207]  __asan_load8+0x58/0xb0
[ 1901.149753]  __ext4_journal_ensure_credits+0x74/0x140 [ext4]
[ 1901.150579]  ext4_xattr_delete_inode+0xe4/0x700 [ext4]
[ 1901.151316]  ext4_evict_inode+0x524/0xba8 [ext4]
[ 1901.151985]  evict+0x1a4/0x378
[ 1901.152353]  iput+0x310/0x428
[ 1901.152733]  do_unlinkat+0x260/0x428
[ 1901.153056]  __arm64_sys_unlinkat+0x6c/0xc0
[ 1901.153455]  el0_svc_common+0xc8/0x320
[ 1901.153799]  el0_svc_handler+0xf8/0x160
[ 1901.154265]  el0_svc+0x10/0x218
[ 1901.154682] ==================================================================

This issue may happens like this:
	Process1                               Process2
ext4_evict_inode
  ext4_journal_start
   ext4_truncate
     ext4_ind_truncate
       ext4_free_branches
         ext4_ind_truncate_ensure_credits
	   ext4_journal_ensure_credits_fn
	     ext4_journal_restart
	       handle->h_transaction = NULL;
                                           mount -o remount,abort  /mnt
					   -> trigger JBD abort
               start_this_handle -> will return failed
  ext4_xattr_delete_inode
    ext4_journal_ensure_credits
      ext4_journal_ensure_credits_fn
        __ext4_journal_ensure_credits
	  jbd2_handle_buffer_credits
	    journal = handle->h_transaction->t_journal; ->null-ptr-deref

Now, indirect truncate process didn't handle error. To solve this issue
maybe simply add check handle is abort in '__ext4_journal_ensure_credits'
is enough, and i also think this is necessary.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211224100341.3299128-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Harshad Shirwadkar
c27c29c6af ext4: initialize err_blk before calling __ext4_get_inode_loc
It is not guaranteed that __ext4_get_inode_loc will definitely set
err_blk pointer when it returns EIO. To avoid using uninitialized
variables, let's first set err_blk to 0.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211201163421.2631661-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:55 -05:00
Chunguang Xu
8c80fb312d ext4: fix a possible ABBA deadlock due to busy PA
We found on older kernel (3.10) that in the scenario of insufficient
disk space, system may trigger an ABBA deadlock problem, it seems that
this problem still exists in latest kernel, try to fix it here. The
main process triggered by this problem is that task A occupies the PA
and waits for the jbd2 transaction finish, the jbd2 transaction waits
for the completion of task B's IO (plug_list), but task B waits for
the release of PA by task A to finish discard, which indirectly forms
an ABBA deadlock. The related calltrace is as follows:

    Task A
    vfs_write
    ext4_mb_new_blocks()
    ext4_mb_mark_diskspace_used()       JBD2
    jbd2_journal_get_write_access()  -> jbd2_journal_commit_transaction()
  ->schedule()                          filemap_fdatawait()
 |                                              |
 | Task B                                       |
 | do_unlinkat()                                |
 | ext4_evict_inode()                           |
 | jbd2_journal_begin_ordered_truncate()        |
 | filemap_fdatawrite_range()                   |
 | ext4_mb_new_blocks()                         |
  -ext4_mb_discard_group_preallocations() <-----

Here, try to cancel ext4_mb_discard_group_preallocations() internal
retry due to PA busy, and do a limited number of retries inside
ext4_mb_discard_preallocations(), which can circumvent the above
problems, but also has some advantages:

1. Since the PA is in a busy state, if other groups have free PAs,
   keeping the current PA may help to reduce fragmentation.
2. Continue to traverse forward instead of waiting for the current
   group PA to be released. In most scenarios, the PA discard time
   can be reduced.

However, in the case of smaller free space, if only a few groups have
space, then due to multiple traversals of the group, it may increase
CPU overhead. But in contrast, I feel that the overall benefit is
better than the cost.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1637630277-23496-1-git-send-email-brookxu.cn@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:55 -05:00
Qing Wang
dfac1a1670 ext4: replace snprintf in show functions with sysfs_emit
coccicheck complains about the use of snprintf() in sysfs show functions.

Fix the coccicheck warning:
WARNING: use scnprintf or sprintf.

Use sysfs_emit instead of scnprintf or sprintf makes more sense.

Signed-off-by: Qing Wang <wangqing@vivo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1634095731-4528-1-git-send-email-wangqing@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Jan Kara
4013d47a53 ext4: make sure to reset inode lockdep class when quota enabling fails
When we succeed in enabling some quota type but fail to enable another
one with quota feature, we correctly disable all enabled quota types.
However we forget to reset i_data_sem lockdep class. When the inode gets
freed and reused, it will inherit this lockdep class (i_data_sem is
initialized only when a slab is created) and thus eventually lockdep
barfs about possible deadlocks.

Reported-and-tested-by: syzbot+3b6f9218b1301ddda3e2@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20211007155336.12493-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:55 -05:00
Jan Kara
15fc69bbbb ext4: make sure quota gets properly shutdown on error
When we hit an error when enabling quotas and setting inode flags, we do
not properly shutdown quota subsystem despite returning error from
Q_QUOTAON quotactl. This can lead to some odd situations like kernel
using quota file while it is still writeable for userspace. Make sure we
properly cleanup the quota subsystem in case of error.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20211007155336.12493-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:54 -05:00
Ye Bin
380a0091ca ext4: Fix BUG_ON in ext4_bread when write quota data
We got issue as follows when run syzkaller:
[  167.936972] EXT4-fs error (device loop0): __ext4_remount:6314: comm rep: Abort forced by user
[  167.938306] EXT4-fs (loop0): Remounting filesystem read-only
[  167.981637] Assertion failure in ext4_getblk() at fs/ext4/inode.c:847: '(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) || handle != NULL || create == 0'
[  167.983601] ------------[ cut here ]------------
[  167.984245] kernel BUG at fs/ext4/inode.c:847!
[  167.984882] invalid opcode: 0000 [] PREEMPT SMP KASAN PTI
[  167.985624] CPU: 7 PID: 2290 Comm: rep Tainted: G    B             5.16.0-rc5-next-20211217+ 
[  167.986823] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  167.988590] RIP: 0010:ext4_getblk+0x17e/0x504
[  167.989189] Code: c6 01 74 28 49 c7 c0 a0 a3 5c 9b b9 4f 03 00 00 48 c7 c2 80 9c 5c 9b 48 c7 c6 40 b6 5c 9b 48 c7 c7 20 a4 5c 9b e8 77 e3 fd ff <0f> 0b 8b 04 244
[  167.991679] RSP: 0018:ffff8881736f7398 EFLAGS: 00010282
[  167.992385] RAX: 0000000000000094 RBX: 1ffff1102e6dee75 RCX: 0000000000000000
[  167.993337] RDX: 0000000000000001 RSI: ffffffff9b6e29e0 RDI: ffffed102e6dee66
[  167.994292] RBP: ffff88816a076210 R08: 0000000000000094 R09: ffffed107363fa09
[  167.995252] R10: ffff88839b1fd047 R11: ffffed107363fa08 R12: ffff88816a0761e8
[  167.996205] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000001
[  167.997158] FS:  00007f6a1428c740(0000) GS:ffff88839b000000(0000) knlGS:0000000000000000
[  167.998238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  167.999025] CR2: 00007f6a140716c8 CR3: 0000000133216000 CR4: 00000000000006e0
[  167.999987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.000944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.001899] Call Trace:
[  168.002235]  <TASK>
[  168.007167]  ext4_bread+0xd/0x53
[  168.007612]  ext4_quota_write+0x20c/0x5c0
[  168.010457]  write_blk+0x100/0x220
[  168.010944]  remove_free_dqentry+0x1c6/0x440
[  168.011525]  free_dqentry.isra.0+0x565/0x830
[  168.012133]  remove_tree+0x318/0x6d0
[  168.014744]  remove_tree+0x1eb/0x6d0
[  168.017346]  remove_tree+0x1eb/0x6d0
[  168.019969]  remove_tree+0x1eb/0x6d0
[  168.022128]  qtree_release_dquot+0x291/0x340
[  168.023297]  v2_release_dquot+0xce/0x120
[  168.023847]  dquot_release+0x197/0x3e0
[  168.024358]  ext4_release_dquot+0x22a/0x2d0
[  168.024932]  dqput.part.0+0x1c9/0x900
[  168.025430]  __dquot_drop+0x120/0x190
[  168.025942]  ext4_clear_inode+0x86/0x220
[  168.026472]  ext4_evict_inode+0x9e8/0xa22
[  168.028200]  evict+0x29e/0x4f0
[  168.028625]  dispose_list+0x102/0x1f0
[  168.029148]  evict_inodes+0x2c1/0x3e0
[  168.030188]  generic_shutdown_super+0xa4/0x3b0
[  168.030817]  kill_block_super+0x95/0xd0
[  168.031360]  deactivate_locked_super+0x85/0xd0
[  168.031977]  cleanup_mnt+0x2bc/0x480
[  168.033062]  task_work_run+0xd1/0x170
[  168.033565]  do_exit+0xa4f/0x2b50
[  168.037155]  do_group_exit+0xef/0x2d0
[  168.037666]  __x64_sys_exit_group+0x3a/0x50
[  168.038237]  do_syscall_64+0x3b/0x90
[  168.038751]  entry_SYSCALL_64_after_hwframe+0x44/0xae

In order to reproduce this problem, the following conditions need to be met:
1. Ext4 filesystem with no journal;
2. Filesystem image with incorrect quota data;
3. Abort filesystem forced by user;
4. umount filesystem;

As in ext4_quota_write:
...
         if (EXT4_SB(sb)->s_journal && !handle) {
                 ext4_msg(sb, KERN_WARNING, "Quota write (off=%llu, len=%llu)"
                         " cancelled because transaction is not started",
                         (unsigned long long)off, (unsigned long long)len);
                 return -EIO;
         }
...
We only check handle if NULL when filesystem has journal. There is need
check handle if NULL even when filesystem has no journal.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211223015506.297766-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Sebastian Andrzej Siewior
ab047d516d ext4: destroy ext4_fc_dentry_cachep kmemcache on module removal
The kmemcache for ext4_fc_dentry_cachep remains registered after module
removal.

Destroy ext4_fc_dentry_cachep kmemcache on module removal.

Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211110134640.lyku5vklvdndw6uk@linutronix.de
Link: https://lore.kernel.org/r/YbiK3JetFFl08bd7@linutronix.de
Link: https://lore.kernel.org/r/20211223164436.2628390-1-bigeasy@linutronix.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Xin Yin
9725958bb7 ext4: fast commit may miss tracking unwritten range during ftruncate
If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the
inode->i_size will not include the unwritten range. When call
ftruncate with fast commit enabled, it will miss to track the
unwritten range.

Change to trace the full range during ftruncate.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Xin Yin
0b5b5a62b9 ext4: use ext4_ext_remove_space() for fast commit replay delete range
For now ,we use ext4_punch_hole() during fast commit replay delete range
procedure. But it will be affected by inode->i_size, which may not
correct during fast commit replay procedure. The following test will
failed.

-create & write foo (len 1000K)
-falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
-create & fsync bar
-falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
-fsync foo
-crash before a full commit

After the fast_commit reply procedure, the range 400K-500K will not be
removed. Because in this case, when calling ext4_punch_hole() the
inode->i_size is 0, and it just retruns with doing nothing.

Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
to remove blocks of inode directly.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00