7257 Commits

Author SHA1 Message Date
Linus Torvalds
7001052160 Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a
coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
 where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
 
 Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
 limited to 2 instructions (and typically fewer) on branch targets not starting
 with ENDBR. CET-IBT also limits speculation of the next sequential instruction
 after the indirect CALL/JMP [1].
 
 CET-IBT is fundamentally incompatible with retpolines, but provides, as
 described above, speculation limits itself.
 
 [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
 1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
 urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
 BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
 y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
 go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
 9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
 ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
 x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
 dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
 EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
 beB/7u2KIUbKEkSN
 =jZfK
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
 "Add support for Intel CET-IBT, available since Tigerlake (11th gen),
  which is a coarse grained, hardware based, forward edge
  Control-Flow-Integrity mechanism where any indirect CALL/JMP must
  target an ENDBR instruction or suffer #CP.

  Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
  is limited to 2 instructions (and typically fewer) on branch targets
  not starting with ENDBR. CET-IBT also limits speculation of the next
  sequential instruction after the indirect CALL/JMP [1].

  CET-IBT is fundamentally incompatible with retpolines, but provides,
  as described above, speculation limits itself"

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html

* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  kvm/emulate: Fix SETcc emulation for ENDBR
  x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
  x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
  kbuild: Fixup the IBT kbuild changes
  x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
  x86: Remove toolchain check for X32 ABI capability
  x86/alternative: Use .ibt_endbr_seal to seal indirect calls
  objtool: Find unused ENDBR instructions
  objtool: Validate IBT assumptions
  objtool: Add IBT/ENDBR decoding
  objtool: Read the NOENDBR annotation
  x86: Annotate idtentry_df()
  x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
  x86: Annotate call_on_stack()
  objtool: Rework ASM_REACHABLE
  x86: Mark __invalid_creds() __noreturn
  exit: Mark do_group_exit() __noreturn
  x86: Mark stop_this_cpu() __noreturn
  objtool: Ignore extra-symbol code
  objtool: Rename --duplicate to --lto
  ...
2022-03-27 10:17:23 -07:00
Linus Torvalds
b1b07ba356 New code for 5.18:
- Fix some incorrect mapping state being passed to iomap during COW
  - Don't create bogus selinux audit messages when deciding to degrade
    gracefully due to lack of privilege
  - Fix setattr implementation to use VFS helpers so that we drop setgid
    consistently with the other filesystems
  - Fix link/unlink/rename to check quota limits
  - Constify xfs_name_dotdot to prevent abuse of in-kernel symbols
  - Fix log livelock between the AIL and inodegc threads during recovery
  - Fix a log stall when the AIL races with pushers
  - Fix stalls in CIL flushes due to pinned inode cluster buffers during
    recovery
  - Fix log corruption due to incorrect usage of xfs_is_shutdown vs
    xlog_is_shutdown because during an induced fs shutdown, AIL writeback
    must continue until the log is shut down, even if the filesystem has
    already shut down
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmI3UQ8ACgkQ+H93GTRK
 tOttpBAAjz05QkClIYKlHREiWt4a2a0FnRBvkz2fvZb0Nk3kST/drJ14VAb1Q3g1
 HvvOGwssyJShkbd6DBGTqrp9AzZ0mfhSYPpARtCNwOvUNV4tXwLdiBZ86H4/l3qy
 seS1tXW4pUKm6xhN9fnMw1sk4oJmPXq67JW+1h35oDaE6tpdd9lehFMpMxaTMjED
 4WsU0UwKCWr4l1lKP1EXvh+ENJeo0QZpccNHV3ChLnWx0mPMRpf4tQP2A01oBPkh
 oXHSms0TV7FHDIv+Zil3kBgkfh266WhcPdZ4pfIqyQc51LzbgrxEqcrZHGI5XJjB
 zcQd8RkzOI68XIDchijOorOkQZmDEDIGlCgSY6q/JV4N2iFyF84hHedk2le5j97l
 Jmout2Xng3bgstl4963IjXk8SvPTebnI76a62XoEcHnf4KBROLKIU7wNCh5LXNvk
 CI6AWJGy6EAGEc3BHyPFZxZF72D9rUmRbPIJBc7rBhJgMoIy4L4sFlvVtyppbERb
 gMH9PNTLjY5/abQkV044iLOsEDh5FEPBIehoaENNo252vj2R/WsAAQL13l0cMsrN
 vAryGdAEelZEyg62k+HT4W87zjH0Kgtgli6zMdx7akYdjKbMOIZALEu61iBH7KT+
 heAz8pU/krTy+Q583XU13eCWnY1wPrHRMwdGF+i4WUGiKuJSoYg=
 =uHT0
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "The biggest change this cycle is bringing XFS' inode attribute setting
  code back towards alignment with what the VFS does. IOWs, setgid bit
  handling should be a closer match with ext4 and btrfs behavior.

  The rest of the branch is bug fixes around the filesystem -- patching
  gaps in quota enforcement, removing bogus selinux audit messages, and
  fixing log corruption and problems with log recovery. There will be a
  second pull request later on in the merge window with more bug fixes.

  Dave Chinner will be taking over as XFS maintainer for one release
  cycle, starting from the day 5.18-rc1 drops until 5.19-rc1 is tagged
  so that I can focus on starting a massive design review for the
  (feature complete after five years) online repair feature.

  Summary:

   - Fix some incorrect mapping state being passed to iomap during COW

   - Don't create bogus selinux audit messages when deciding to degrade
     gracefully due to lack of privilege

   - Fix setattr implementation to use VFS helpers so that we drop
     setgid consistently with the other filesystems

   - Fix link/unlink/rename to check quota limits

   - Constify xfs_name_dotdot to prevent abuse of in-kernel symbols

   - Fix log livelock between the AIL and inodegc threads during
     recovery

   - Fix a log stall when the AIL races with pushers

   - Fix stalls in CIL flushes due to pinned inode cluster buffers
     during recovery

   - Fix log corruption due to incorrect usage of xfs_is_shutdown vs
     xlog_is_shutdown because during an induced fs shutdown, AIL
     writeback must continue until the log is shut down, even if the
     filesystem has already shut down"

* tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
  xfs: AIL should be log centric
  xfs: log items should have a xlog pointer, not a mount
  xfs: async CIL flushes need pending pushes to be made stable
  xfs: xfs_ail_push_all_sync() stalls when racing with updates
  xfs: check buffer pin state after locking in delwri_submit
  xfs: log worker needs to start before intent/unlink recovery
  xfs: constify xfs_name_dotdot
  xfs: constify the name argument to various directory functions
  xfs: reserve quota for target dir expansion when renaming files
  xfs: reserve quota for dir expansion when linking/unlinking files
  xfs: refactor user/group quota chown in xfs_setattr_nonsize
  xfs: use setattr_copy to set vfs inode attributes
  xfs: don't generate selinux audit messages for capability testing
  xfs: add missing cmap->br_state = XFS_EXT_NORM update
2022-03-24 18:28:01 -07:00
Linus Torvalds
3ce62cf4dc flexible-array transformations for 5.18-rc1
Hi Linus,
 
 Please, pull the following treewide patch that replaces zero-length arrays with
 flexible-array members. This patch has been baking in linux-next for a
 whole development cycle.
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmI6GIUACgkQRwW0y0cG
 2zFLWw/+OB1gZeQD3boKpUMntWnn6wjhUxdrO8CYkpzG+B+8TFECXNjy8HV1CSiw
 GKKRndYELOyYaD5o/F2vtPe10iPHbrdIlMFRPBRoht0/cvSZgzHlfT8EjWQwerYY
 dieztUFKjeSj0MXivdNDnKOTm8o9cz8KmCrWFP+My37Fasn/9+nBX8iNVIvAX4xy
 T+IVmjtDifQUsTs298UGnBvDeuZOiGHhXXU5rq6lIX0Rl554OsWZW94d6jUPj/h7
 t1v6jdojNuyaMKn45/xnPj9VvmDiSu3K67m3fjRdzLPDOhISjr2fw4KEUOKdsebh
 yJ9t5u8IufyPbm9kyI+rZt+T8ZlV2/qt2+mt6QgtDMnWrs+4nU15JY0SHImMSBZQ
 rBEZcQlrIcGJ+CsNB8Y7jIGYO0SSkhodAvfl0LRA0AbTqLGqq0OkAQS5D52r3H2r
 uz6xdYb7kG43XaRyaAIPqhZsp/jk2NrXvEvin2tSaXZFR1cxp+oxcV2UajmnOU6i
 EIBS4PzJnYx2RZRa+h8YbBa/+D4N6+fj/tjmwBawiUBPjjaLAsGFNwUHqvBoD05S
 bk6oXi654NBwVjsknZ0grVz0TtSvdZ3uJL5FZApTOHITqH8vlxlNefmHri4vZRZO
 NN7NIQ0yaUCnorzMg+vP8ZtflhQwrMJbjwIS9YD0RHd7MBhYX8k=
 =xZD2
 -----END PGP SIGNATURE-----

Merge tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull flexible-array transformations from Gustavo Silva:
 "Treewide patch that replaces zero-length arrays with flexible-array
  members.

  This has been baking in linux-next for a whole development cycle"

* tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  treewide: Replace zero-length arrays with flexible-array members
2022-03-24 11:39:32 -07:00
Linus Torvalds
6b1f86f8e9 Filesystem folio changes for 5.18
Primarily this series converts some of the address_space operations
 to take a folio instead of a page.
 
 ->is_partially_uptodate() takes a folio instead of a page and changes the
 type of the 'from' and 'count' arguments to make it obvious they're bytes.
 ->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
 ->launder_page() becomes ->launder_folio()
 ->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
 an argument.
 
 There are a couple of other misc changes up front that weren't worth
 separating into their own pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
 gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
 BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
 SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
 EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
 omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
 Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
 =cOiq
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache

Pull filesystem folio updates from Matthew Wilcox:
 "Primarily this series converts some of the address_space operations to
  take a folio instead of a page.

  Notably:

   - a_ops->is_partially_uptodate() takes a folio instead of a page and
     changes the type of the 'from' and 'count' arguments to make it
     obvious they're bytes.

   - a_ops->invalidatepage() becomes ->invalidate_folio() and has a
     similar type change.

   - a_ops->launder_page() becomes ->launder_folio()

   - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
     address_space as an argument.

  There are a couple of other misc changes up front that weren't worth
  separating into their own pull request"

* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
  fs: Remove aops ->set_page_dirty
  fb_defio: Use noop_dirty_folio()
  fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
  fs: Convert __set_page_dirty_buffers to block_dirty_folio
  nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
  mm: Convert swap_set_page_dirty() to swap_dirty_folio()
  ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
  f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
  f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
  f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
  afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
  btrfs: Convert extent_range_redirty_for_io() to use folios
  fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
  btrfs: Convert from set_page_dirty to dirty_folio
  fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
  fs: Add aops->dirty_folio
  fs: Remove aops->launder_page
  orangefs: Convert launder_page to launder_folio
  nfs: Convert from launder_page to launder_folio
  fuse: Convert from launder_page to launder_folio
  ...
2022-03-22 18:26:56 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Hugh Dickins
b698f0a177 mm/fs: delete PF_SWAPWRITE
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE.  But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:08 -07:00
Muchun Song
fd60b28842 fs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
NeilBrown
b9b1335e64 remove bdi_congested() and wb_congested() and related functions
These functions are no longer useful as no BDIs report congestions any
more.

Removing the test on bdi_write_contested() in current_may_throttle()
could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
is set.

So replace the calls by 'false' and simplify the code - and remove the
functions.

[akpm@linux-foundation.org: fix build]

Link: https://lkml.kernel.org/r/164549983742.9187.2570198746005819592.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>	[nilfs]
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:01 -07:00
Linus Torvalds
616355cc81 for-5.18/block-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0+GcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprUpD/9aTJEnj7VCw7UouSsg098sdjtoy9ilslU3
 ew47K8CIXHbCB4CDqLnFyvCwAdG1XGgS+fUmFAxvTr29R9SZeS5d+bXL6sZzEo0C
 bwxsJy9MM2QRtMvB+giAt1myXbwB8cG+ketMBWXqwXXRHRzPbbQfMZia7FqWMnfY
 KQanH9IwYHp1oa5U/W6Qcjm4oCnLgBMRwqByzUCtiF3y9qgaLkK+3IgkNwjJQjLA
 DTeUJ/9CgxGQQbzA+LPktbw2xfTqiUfcKq0mWx6Zt4wwNXn1ClqUDUXX6QSM8/5u
 3OimbscSkEPPTIYZbVBPkhFnAlQb4JaJEgOrbXvYKVV2Dh+eZY81XwNeE/E8gdBY
 TnHOTOCjkN/4sR3hIrWazlJzPLdpPA0eOYrhguCraQsX9mcsYNxlJ9otRv/Ve99g
 uqL0RZg3+NoK84fm79FCGy/ZmPQJvJttlBT9CKVwylv/Lky42xWe7AdM3OipKluY
 2nh+zN5Ai7WxZdTKXQFRhCSWfWQ+1qW51tB3dcGW+BooZr/oox47qKQVcHsEWbq1
 RNR45F5a4AuPwYUHF/P36WviLnEuq9AvX7OTTyYOplyVQohKIoDXp9chVzLNzBiZ
 KBR00W6MLKKKN+8foalQWgNyb2i2PH7Ib4xRXvXj/22Vwxg5UmUoBmSDSas9SZUS
 +dMo7CtNgA==
 =DpgP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo)

 - blk-rq-qos completion fix (Tejun)

 - blk-cgroup merge fix (Tejun)

 - Add offline error return value to distinguish it from an IO error on
   the device (Song)

 - IO stats fixes (Zhang, Christoph)

 - blkcg refcount fixes (Ming, Yu)

 - Fix for indefinite dispatch loop softlockup (Shin'ichiro)

 - blk-mq hardware queue management improvements (Ming)

 - sbitmap dead code removal (Ming, John)

 - Plugging merge improvements (me)

 - Show blk-crypto capabilities in sysfs (Eric)

 - Multiple delayed queue run improvement (David)

 - Block throttling fixes (Ming)

 - Start deprecating auto module loading based on dev_t (Christoph)

 - bio allocation improvements (Christoph, Chaitanya)

 - Get rid of bio_devname (Christoph)

 - bio clone improvements (Christoph)

 - Block plugging improvements (Christoph)

 - Get rid of genhd.h header (Christoph)

 - Ensure drivers use appropriate flush helpers (Christoph)

 - Refcounting improvements (Christoph)

 - Queue initialization and teardown improvements (Ming, Christoph)

 - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng,
   Lukas, Nian, Yang, Eric, Chengming)

* tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits)
  block: cancel all throttled bios in del_gendisk()
  block: let blkcg_gq grab request queue's refcnt
  block: avoid use-after-free on throttle data
  block: limit request dispatch loop duration
  block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"
  sr: simplify the local variable initialization in sr_block_open()
  block: don't merge across cgroup boundaries if blkcg is enabled
  block: fix rq-qos breakage from skipping rq_qos_done_bio()
  block: flush plug based on hardware and software queue order
  block: ensure plug merging checks the correct queue at least once
  block: move rq_qos_exit() into disk_release()
  block: do more work in elevator_exit
  block: move blk_exit_queue into disk_release
  block: move q_usage_counter release into blk_queue_release
  block: don't remove hctx debugfs dir from blk_mq_exit_queue
  block: move blkcg initialization/destroy into disk allocation/release handler
  sr: implement ->free_disk to simplify refcounting
  sd: implement ->free_disk to simplify refcounting
  sd: delay calling free_opal_dev
  sd: call sd_zbc_release_disk before releasing the scsi_device reference
  ...
2022-03-21 16:48:55 -07:00
Dave Chinner
01728b44ef xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
I've been chasing a recent resurgence in generic/388 recovery
failure and/or corruption events. The events have largely been
uninitialised inode chunks being tripped over in log recovery
such as:

 XFS (pmem1): User initiated shutdown received.
 pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
 XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
 XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
 XFS (pmem1): Unmounting Filesystem
 XFS (pmem1): Mounting V5 Filesystem
 XFS (pmem1): Starting recovery (logdev: internal)
 XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
 XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
 XFS (pmem1): Unmount and run xfs_repair
 XFS (pmem1): First 128 bytes of corrupted metadata buffer:
 00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
 XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
 XFS (pmem1): log mount/recovery failed: error -117
 XFS (pmem1): log mount failed

There have been isolated random other issues, too - xfs_repair fails
because it finds some corruption in symlink blocks, rmap
inconsistencies, etc - but they are nowhere near as common as the
uninitialised inode chunk failure.

The problem has clearly happened at runtime before recovery has run;
I can see the ICREATE log item in the log shortly before the
actively recovered range of the log. This means the ICREATE was
definitely created and written to the log, but for some reason the
tail of the log has been moved past the ordered buffer log item that
tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
log moving past the ICREATE log item before the inode chunk buffer
is written to disk.

Tracing the fsstress processes that are running when the filesystem
shut down immediately pin-pointed the problem:

user shutdown marks xfs_mount as shutdown

         godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
.....

aild tries to push ordered inode cluster buffer

  xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
  xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae

xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
calls xfs_iflush_abort() to kill writeback of the inode.
Inode is removed from AIL, drops cluster buffer reference.

  xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
  xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7

.....

All inodes on cluster buffer are aborted, then the cluster buffer
itself is aborted and removed from the AIL *without writeback*:

xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
   xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL

The inode buffer was at 7/20344 when it was removed from the AIL.

   xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
   xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39

.....

Userspace is still running, doing stuff. an fsstress process runs
syncfs() or sync() and we end up in sync_fs_one_sb() which issues
a log force. This pushes on the CIL:

        fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
        fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
        fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
           <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f

And the CIL fills up iclogs with pending changes. This picks up
the current tail from the AIL:

           <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
           <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
           <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
           <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448

And it moves the tail of the log to 7/21000 from 7/20344. This
*moves the tail of the log beyond the ICREATE transaction* that was
at 7/20344 and pinned by the inode cluster buffer that was cancelled
above.

....

         godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
          godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
          godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/

And finally the log itself is now shutdown, stopping all further
writes to the log. But this is too late to prevent the corruption
that moving the tail of the log forwards after we start cancelling
writeback causes.

The fundamental problem here is that we are using the wrong shutdown
checks for log items. We've long conflated mount shutdown with log
shutdown state, and I started separating that recently with the
atomic shutdown state changes in commit b36d4651e165 ("xfs: make
forced shutdown processing atomic"). The changes in that commit
series are directly responsible for being able to diagnose this
issue because it clearly separated mount shutdown from log shutdown.

Essentially, once we start cancelling writeback of log items and
removing them from the AIL because the filesystem is shut down, we
*cannot* update the journal because we may have cancelled the items
that pin the tail of the log. That moves the tail of the log
forwards without having written the metadata back, hence we have
corrupt in memory state and writing to the journal propagates that
to the on-disk state.

What commit b36d4651e165 makes clear is that log item state needs to
change relative to log shutdown, not mount shutdown. IOWs, anything
that aborts metadata writeback needs to check log shutdown state
because log items directly affect log consistency. Having them check
mount shutdown state introduces the above race condition where we
cancel metadata writeback before the log shuts down.

To fix this, this patch works through all log items and converts
shutdown checks to use xlog_is_shutdown() rather than
xfs_is_shutdown(), so that we don't start aborting metadata
writeback before we shut off journal writes.

AFAICT, this race condition is a zero day IO error handling bug in
XFS that dates back to the introduction of XLOG_IO_ERROR,
XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:50 -07:00
Dave Chinner
8eda872110 xfs: AIL should be log centric
The AIL operates purely on log items, so it is a log centric
subsystem. Divorce it from the xfs_mount and instead have it pass
around xlog pointers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Dave Chinner
d86142dd7c xfs: log items should have a xlog pointer, not a mount
Log items belong to the log, not the xfs_mount. Convert the mount
pointer in the log item to a xlog pointer in preparation for
upcoming log centric changes to the log items.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Dave Chinner
70447e0ad9 xfs: async CIL flushes need pending pushes to be made stable
When the AIL tries to flush the CIL, it relies on the CIL push
ending up on stable storage without having to wait for and
manipulate iclog state directly. However, if there is already a
pending CIL push when the AIL tries to flush the CIL, it won't set
the cil->xc_push_commit_stable flag and so the CIL push will not
actively flush the commit record iclog.

generic/530 when run on a single CPU test VM can trigger this fairly
reliably. This test exercises unlinked inode recovery, and can
result in inodes being pinned in memory by ongoing modifications to
the inode cluster buffer to record unlinked list modifications. As a
result, the first inode unlinked in a buffer can pin the tail of the
log whilst the inode cluster buffer is pinned by the current
checkpoint that has been pushed but isn't on stable storage because
because the cil->xc_push_commit_stable was not set. This results in
the log/AIL effectively deadlocking until something triggers the
commit record iclog to be pushed to stable storage (i.e. the
periodic log worker calling xfs_log_force()).

The fix is two-fold - first we should always set the
cil->xc_push_commit_stable when xlog_cil_flush() is called,
regardless of whether there is already a pending push or not.

Second, if the CIL is empty, we should trigger an iclog flush to
ensure that the iclogs of the last checkpoint have actually been
submitted to disk as that checkpoint may not have been run under
stable completion constraints.

Reported-and-tested-by: Matthew Wilcox <willy@infradead.org>
Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Dave Chinner
941fbdfd6d xfs: xfs_ail_push_all_sync() stalls when racing with updates
xfs_ail_push_all_sync() has a loop like this:

while max_ail_lsn {
	prepare_to_wait(ail_empty)
	target = max_ail_lsn
	wake_up(ail_task);
	schedule()
}

Which is designed to sleep until the AIL is emptied. When
xfs_ail_update_finish() moves the tail of the log, it does:

	if (list_empty(&ailp->ail_head))
		wake_up_all(&ailp->ail_empty);

So it will only wake up the sync push waiter when the AIL goes
empty. If, by the time the push waiter has woken, the AIL has more
in it, it will reset the target, wake the push task and go back to
sleep.

The problem here is that if the AIL is having items added to it
when xfs_ail_push_all_sync() is called, then they may get inserted
into the AIL at a LSN higher than the target LSN. At this point,
xfsaild_push() will see that the target is X, the item LSNs are
(X+N) and skip over them, hence never pushing the out.

The result of this the AIL will not get emptied by the AIL push
thread, hence xfs_ail_finish_update() will never see the AIL being
empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
gets woken and hence cannot update the push target to capture the
items beyond the current target on the LSN.

This is a TOCTOU type of issue so the way to avoid it is to not
use the push target at all for sync pushes. We know that a sync push
is being requested by the fact the ail_empty wait queue is active,
hence the xfsaild can just set the target to max_ail_lsn on every
push that we see the wait queue active. Hence we no longer will
leave items on the AIL that are beyond the LSN sampled at the start
of a sync push.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Dave Chinner
dbd0f52993 xfs: check buffer pin state after locking in delwri_submit
AIL flushing can get stuck here:

[316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
[316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
[316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
[316649.014112] Call Trace:
[316649.014841]  <TASK>
[316649.015492]  __schedule+0x30d/0x9e0
[316649.017745]  schedule+0x55/0xd0
[316649.018681]  io_schedule+0x4b/0x80
[316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
[316649.021850]  __xfs_buf_submit+0x14a/0x230
[316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
[316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
[316649.025931]  xfsaild+0x27e/0x9d0
[316649.028283]  kthread+0xf6/0x120
[316649.030602]  ret_from_fork+0x1f/0x30

in the situation where flushing gets preempted between the unpin
check and the buffer trylock under nowait conditions:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

This means submission is stuck until something else triggers a log
force to unpin the buffer.

To get onto the delwri list to begin with, the buffer pin state has
already been checked, and hence it's relatively rare we get a race
between flushing and encountering a pinned buffer in delwri
submission to begin with. Further, to increase the pin count the
buffer has to be locked, so the only way we can hit this race
without failing the trylock is to be preempted between the pincount
check seeing zero and the trylock being run.

Hence to avoid this problem, just invert the order of trylock vs
pin check. We shouldn't hit that many pinned buffers here, so
optimising away the trylock for pinned buffers should not matter for
performance at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Dave Chinner
a9a4bc8c76 xfs: log worker needs to start before intent/unlink recovery
After 963 iterations of generic/530, it deadlocked during recovery
on a pinned inode cluster buffer like so:

XFS (pmem1): Starting recovery (logdev: internal)
INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
      Not tainted 5.17.0-rc6-dgc+ #975
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iunlink_map_ino+0x66/0xd0
 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
 xfs_iunlink_remove_inode+0xf2/0x1d0
 xfs_inactive_ifree+0x1a3/0x900
 xfs_inode_unlink+0xcc/0x210
 xfs_inodegc_worker+0x1ac/0x2f0
 process_one_work+0x1ac/0x390
 worker_thread+0x56/0x3c0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30
 </TASK>
task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iget+0x300/0xb40
 xlog_recover_process_one_iunlink+0x4c/0x170
 xlog_recover_process_iunlinks.isra.0+0xee/0x130
 xlog_recover_finish+0x57/0x110
 xfs_log_mount_finish+0xfc/0x1e0
 xfs_mountfs+0x540/0x910
 xfs_fs_fill_super+0x495/0x850
 get_tree_bdev+0x171/0x270
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x304/0xba0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
 </TASK>
task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 io_schedule+0x4b/0x80
 xfs_buf_wait_unpin+0x9e/0xf0
 __xfs_buf_submit+0x14a/0x230
 xfs_buf_delwri_submit_buffers+0x107/0x280
 xfs_buf_delwri_submit_nowait+0x10/0x20
 xfsaild+0x27e/0x9d0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30

We have the mount process waiting on an inode cluster buffer read,
inodegc doing unlink waiting on the same inode cluster buffer, and
the AIL push thread blocked in writeback waiting for the inode
cluster buffer to become unpinned.

What has happened here is that the AIL push thread has raced with
the inodegc process modifying, committing and pinning the inode
cluster buffer here in xfs_buf_delwri_submit_buffers() here:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

Basically, the AIL has found the buffer wasn't pinned and got the
lock without blocking, but then the buffer was pinned. This implies
the processing here was pre-empted between the pin check and the
lock, because the pin count can only be increased while holding the
buffer locked. Hence when it has gone to submit the IO, it has
blocked waiting for the buffer to be unpinned.

With all executing threads now waiting on the buffer to be unpinned,
we normally get out of situations like this via the background log
worker issuing a log force which will unpinned stuck buffers like
this. But at this point in recovery, we haven't started the log
worker. In fact, the first thing we do after processing intents and
unlinked inodes is *start the log worker*. IOWs, we start it too
late to have it break deadlocks like this.

Avoid this and any other similar deadlock vectors in intent and
unlinked inode recovery by starting the log worker before we recover
intents and unlinked inodes. This part of recovery runs as though
the filesystem is fully active, so we really should have the same
infrastructure running as we normally do at runtime.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
Matthew Wilcox (Oracle)
46de8b9794 fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
This is a mechanical change.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-16 13:37:05 -04:00
Matthew Wilcox (Oracle)
187c82cb03 fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
5660a8630d fs: Remove noop_invalidatepage()
We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)
d82354f6b0 iomap: Remove iomap_invalidatepage()
Use iomap_invalidate_folio() in all the iomap-based filesystems
and rename the iomap_invalidatepage tracepoint.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:29 -04:00
Masahiro Yamada
83a44a4f47 x86: Remove toolchain check for X32 ABI capability
Commit 0bf6276392e9 ("x32: Warn and disable rather than error if
binutils too old") added a small test in arch/x86/Makefile because
binutils 2.22 or newer is needed to properly support elf32-x86-64. This
check is no longer necessary, as the minimum supported version of
binutils is 2.23, which is enforced at configuration time with
scripts/min-tool-version.sh.

Remove this check and replace all uses of CONFIG_X86_X32 with
CONFIG_X86_X32_ABI, as two symbols are no longer necessary.

[nathan: Rebase, fix up a few places where CONFIG_X86_X32 was still
         used, and simplify commit message to satisfy -tip requirements]

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220314194842.3452-2-nathan@kernel.org
2022-03-15 10:32:48 +01:00
Darrick J. Wong
744e6c8ada xfs: constify xfs_name_dotdot
The symbol xfs_name_dotdot is a global variable that the xfs codebase
uses here and there to look up directory dotdot entries.  Currently it's
a non-const variable, which means that it's a mutable global variable.
So far nobody's abused this to cause problems, but let's use the
compiler to enforce that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
996b2329b2 xfs: constify the name argument to various directory functions
Various directory functions do not modify their @name parameter,
so mark it const to make that clear.  This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
41667260bc xfs: reserve quota for target dir expansion when renaming files
XFS does not reserve quota for directory expansion when renaming
children into a directory.  This means that we don't reject the
expansion with EDQUOT when we're at or near a hard limit, which means
that unprivileged userspace can use rename() to exceed quota.

Rename operations don't always expand the target directory, and we allow
a rename to proceed with no space reservation if we don't need to add a
block to the target directory to handle the addition.  Moreover, the
unlink operation on the source directory generally does not expand the
directory (you'd have to free a block and then cause a btree split) and
it's probably of little consequence to leave the corner case that
renaming a file out of a directory can increase its size.

As with link and unlink, there is a further bug in that we do not
trigger the blockgc workers to try to clear space when we're out of
quota.

Because rename is its own special tricky animal, we'll patch xfs_rename
directly to reserve quota to the rename transaction.  We'll leave
cleaning up the rest of xfs_rename for the metadata directory tree
patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
871b9316e7 xfs: reserve quota for dir expansion when linking/unlinking files
XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory.  This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.

The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.

Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.

To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory.  If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode.  This should prevent
quota usage overruns with the least restriction in functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
dd3b015dd8 xfs: refactor user/group quota chown in xfs_setattr_nonsize
Combine if tests to reduce the indentation levels of the quota chown
calls in xfs_setattr_nonsize.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
e014f37db1 xfs: use setattr_copy to set vfs inode attributes
Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
revocation isn't consistent with btrfs[1] or ext4.  Those two
filesystems use the VFS function setattr_copy to convey certain
attributes from struct iattr into the VFS inode structure.

Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
decide if it should clear setgid and setuid on a file attribute update.
This is a second symptom of the problem that Filipe noticed.

XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
xfs_setattr_nonsize, and xfs_setattr_time.  Regrettably, setattr_copy is
/not/ a simple copy function; it contains additional logic to clear the
setgid bit when setting the mode, and XFS' version no longer matches.

The VFS implements its own setuid/setgid stripping logic, which
establishes consistent behavior.  It's a tad unfortunate that it's
scattered across notify_change, should_remove_suid, and setattr_copy but
XFS should really follow the Linux VFS.  Adapt XFS to use the VFS
functions and get rid of the old functions.

[1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/

Fixes: 7fa294c8991c ("userns: Allow chown and setgid preservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-03-14 10:23:16 -07:00
Darrick J. Wong
eba0549bc7 xfs: don't generate selinux audit messages for capability testing
There are a few places where we test the current process' capability set
to decide if we're going to be more or less generous with resource
acquisition for a system call.  If the process doesn't have the
capability, we can continue the call, albeit in a degraded mode.

These are /not/ the actual security decisions, so it's not proper to use
capable(), which (in certain selinux setups) causes audit messages to
get logged.  Switch them to has_capability_noaudit.

Fixes: 7317a03df703f ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom")
Fixes: ea9a46e1c4925 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-03-09 10:32:06 -08:00
Gao Xiang
1a39ae415c xfs: add missing cmap->br_state = XFS_EXT_NORM update
COW extents are already converted into written real extents after
xfs_reflink_convert_cow_locked(), therefore cmap->br_state should
reflect it.

Otherwise, there is another necessary unwritten convertion
triggered in xfs_dio_write_end_io() for direct I/O cases.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-09 10:32:06 -08:00
Linus Torvalds
3bd9dd8138 Bug fixes for 5.17-rc4:
- Only call sync_filesystem when we're remounting the filesystem
    readonly readonly, and actually check its return value.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmIEnhUACgkQ+H93GTRK
 tOsn7g//e3R/lqpkx6jJtg6SqiC1KiI9euwD0wBdIvrCWSJZ6IdjOSvfRRG13vN0
 S1spU0joiLLlVhzLIQdysgZkRub57P0mRmq3zVpHYxMOWKBvPH1OmZtdu83HOiAv
 /zjy3tNFc/1ZaqHudAZv3+4780qMZtQTmL7DbgLnvFspCBf4PdBlT0d7Wbf982w8
 dMWF7Pla8DhLVFbMsGdyXlnGROz+pw3jofVwY9P6f+PaY37mo+lZu65GrTlNecnc
 QfTyX45VAWFO/XZtXm7pXCr8211eK2SnrOFZXZH9u3qxSD5vo1NWf9KPKVkYxc8/
 7icz+Yp5t61HQg3o0z7cNAQZp7CQl0BWz6gp2YXMWHS3ZJMnd6H4zTDBdV2MSA5/
 alT4kcwncRVcmHtFET7JAsnQkWNeREBqhqCRoAf8hW8uxpjkXw6sPop7+hbZtoJw
 VAp1TxbEMbPGTZb76Kw4nZt1eZ3SyJOl6ByzsJMxekEFiMYVh4yxO+a3Q6KNOkMM
 O62JpzdE1EeFgV7qmoZ8QzCZuD7z7KC99iv5QtyacFITCqv5y0h/RLGCsOwJ0EMc
 fJGKN7uQOZrBIJYInx53S7fCYGGMm0+HUUXMUatBe4RK3dADyqapLzQb0tCGamAf
 NQra6NotwfNq8SN+Sn17PJ1KifSRKfw6l7Q+6pt9LA2eVbr2jV4=
 =6ODm
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Nothing exciting, just more fixes for not returning sync_filesystem
  error values (and eliding it when it's not necessary).

  Summary:

   - Only call sync_filesystem when we're remounting the filesystem
     readonly readonly, and actually check its return value"

* tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: only bother with sync_filesystem during readonly remount
2022-02-26 09:53:19 -08:00
Gustavo A. R. Silva
5224f79096 treewide: Replace zero-length arrays with flexible-array members
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

This code was transformed with the help of Coccinelle:
(next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch)

@@
identifier S, member, array;
type T1, T2;
@@

struct S {
  ...
  T1 member;
  T2 array[
- 0
  ];
};

UAPI and wireless changes were intentionally excluded from this patch
and will be sent out separately.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2022-02-17 07:00:39 -06:00
Darrick J. Wong
b97cca3ba9 xfs: only bother with sync_filesystem during readonly remount
In commit 02b9984d6408, we pushed a sync_filesystem() call from the VFS
into xfs_fs_remount.  The only time that we ever need to push dirty file
data or metadata to disk for a remount is if we're remounting the
filesystem read only, so this really could be moved to xfs_remount_ro.

Once we've moved the call site, actually check the return value from
sync_filesystem.

Fixes: 02b9984d6408 ("fs: push sync_filesystem() down to the file system's remount_fs()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-02-09 21:07:24 -08:00
Linus Torvalds
fbc04bf01a Fixes for 5.17-rc3:
- Fix fallocate so that it drops all file privileges when files are
    modified instead of open-coding that incompletely.
  - Fix fallocate to flush the log if the caller wanted synchronous file
    updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmH7/7wACgkQ+H93GTRK
 tOut/BAAl2aLNDdBjjp9rBbNdnZtXSQ3JKBo/6FrGX+7M2rr5/Ppskzfm2FE6HQj
 TS4CNAg1Gid/gbK956iOU+E3LvGR1lUA3yhtm0OZ7yffQ+1zGA2RUCTy39SZEUTI
 PTpzF6gH/oZAKrfv2r9yLEFJdPyBoo8BBBHBRshw5TiDqMG1pJ3toTySehLXHsRf
 Bf4pXGyOy0+q9zkIprUQg5l3cEEdoOW5mLwv/1soiLx+mz6Go/MHFaAO/U3m2KBq
 mwNUZcXoRSihyd7xKEqLQfYuy7abrtICsupoMGvCzFI1ox1ZzPN5/aNAI1LBXz8+
 tK+gRaJg7stGVuI28zIEw4mmcLWG+h46fyuunhY6EIcoXZzpcXL9gxQzle6ypE8e
 ScRtRnjUd4CsDyRNzDez082K1T6M6pD3QXMWPn29/WnjKElbEKwqDRjri4HOqwyq
 A3buwhvuzK2ZS8f2k1/YLrNqZLCZwd9LRsz085GI7HUp7cardiTFERNzt4Nt935k
 T6lDFNKXVW9MZvSXGWNGmwHAQRvmkW+i7pyhE1LCsTwF0Q6YNwbhZAl9SJK5S5Zk
 TJEam4d6xY86Ppu7kXcYWtv8gJxMJyiGbywI8xZDz1VrkwNbfB0ZK/8A9KEMGc1r
 fuUxTZF12BY5Sm5YaxlmrDnkkvn3K9CHQ3DI7BiP+IldwDdhuOU=
 =dqTg
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "I was auditing operations in XFS that clear file privileges, and
  realized that XFS' fallocate implementation drops suid/sgid but
  doesn't clear file capabilities the same way that file writes and
  reflink do.

  There are VFS helpers that do it correctly, so refactor XFS to use
  them. I also noticed that we weren't flushing the log at the correct
  point in the fallocate operation, so that's fixed too.

  Summary:

   - Fix fallocate so that it drops all file privileges when files are
     modified instead of open-coding that incompletely.

   - Fix fallocate to flush the log if the caller wanted synchronous
     file updates"

* tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: ensure log flush at the end of a synchronous fallocate call
  xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
  xfs: set prealloc flag in xfs_alloc_file_space()
  xfs: fallocate() should call file_modified()
  xfs: remove XFS_PREALLOC_SYNC
  xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
2022-02-05 09:21:55 -08:00
Linus Torvalds
ea7b3e6d42 Fixes for 5.17-rc3:
- Fix a bug where callers of ->sync_fs (e.g. sync_filesystem and
    syncfs(2)) ignore the return value.
  - Fix a bug where callers of sync_filesystem (e.g. fs freeze) ignore
    the return value.
  - Fix a bug in XFS where xfs_fs_sync_fs never passed back error
    returns.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmH2xBcACgkQ+H93GTRK
 tOvcFQ//Wo3mMeyte2IkA4Km6tKxB5TMk3oEAYRcp5H2w4tmAnqrgukFRHR3kee8
 b7/qBqAeI66FsneT74Rvb5gR+X2gPgiwKg88ZKmCnufJruL2YNL5D0u7e8sTaqO6
 DLIf8EXw2l9KZOsMjPy2wBbflvufhM1LciXR25YwslYXUySaIDqNJ6R58Xyx7xni
 WNNxkM4rHry2AoLBGnyLti88xguLCSGdS5vEyiNj/dKHZn47/J8xXPppFyDyzGA2
 2oVfGCIgUZGL/Hy99iM8kTIN50GS0Eq1tiM6qUnad2U8KUWj1g2l9yqL6pCC3iuT
 PLn4iB7IDV6zs+FOLdgEYLp7iw/2BxUzocBTaZQlZm0VMJ8UYQEvOeecnOqnly+9
 tt12ZLk3iFVSTUejw3PofqrHc2nTUJr8ojrzKrwIc0Pmur6YlIlO99RHUxy+Li4o
 IVh+D5cqOiwCul5cdGk9120dBACEICI+Q70vjYpnwrgjozHg1tML8d1VtPI9+fs3
 fSzjkLlIc+vgLYgsxvmZg8kSEd2hDb3lq2aSirx42aaEvX94QDly/SPNz78FdehZ
 JWrSewp8rBAW8+yvfBKKcpyhCyULLCGIQjKOziuwLpjjEHMOd6CpJ5w/2c/fKxzy
 B775+6lSb+P/FG3ikW7PKYNa5wWfgp0/Maz79XiuAcUJI5FK9R8=
 =BdrI
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull vfs fixes from Darrick Wong:
 "I was auditing the sync_fs code paths recently and noticed that most
  callers of ->sync_fs ignore its return value (and many implementations
  never return nonzero even if the fs is broken!), which means that
  internal fs errors and corruption are not passed up to userspace
  callers of syncfs(2) or FIFREEZE. Hence fixing the common code and
  XFS, and I'll start working on the ext4/btrfs folks if this is merged.

  Summary:

   - Fix a bug where callers of ->sync_fs (e.g. sync_filesystem and
     syncfs(2)) ignore the return value.

   - Fix a bug where callers of sync_filesystem (e.g. fs freeze) ignore
     the return value.

   - Fix a bug in XFS where xfs_fs_sync_fs never passed back error
     returns"

* tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: return errors in xfs_fs_sync_fs
  quota: make dquot_quota_sync return errors from ->sync_fs
  vfs: make sync_filesystem return errors from ->sync_fs
  vfs: make freeze_super abort when sync_filesystem returns error
2022-02-05 09:13:51 -08:00
Christoph Hellwig
49add4966d block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment.  A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Christoph Hellwig
07888c665b block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment.  NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Dave Chinner
cea267c235 xfs: ensure log flush at the end of a synchronous fallocate call
Since we've started treating fallocate more like a file write, we
should flush the log to disk if the user has asked for synchronous
writes either by setting it via fcntl flags, or inode flags, or with
the sync mount option.  We've already got a helper for this, so use
it.

[The original patch by Darrick was massaged by Dave to fit this patchset]

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
b39a04636f xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
The operations that xfs_update_prealloc_flags() perform are now
unique to xfs_fs_map_blocks(), so move xfs_update_prealloc_flags()
to be a static function in xfs_pnfs.c and cut out all the
other functionality that is doesn't use anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
0b02c8c0d7 xfs: set prealloc flag in xfs_alloc_file_space()
Now that we only call xfs_update_prealloc_flags() from
xfs_file_fallocate() in the case where we need to set the
preallocation flag, do this in xfs_alloc_file_space() where we
already have the inode joined into a transaction and get
rid of the call to xfs_update_prealloc_flags() from the fallocate
code.

This also means that we now correctly avoid setting the
XFS_DIFLAG_PREALLOC flag when xfs_is_always_cow_inode() is true, as
these inodes will never have preallocated extents.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
fbe7e52003 xfs: fallocate() should call file_modified()
In XFS, we always update the inode change and modification time when
any fallocate() operation succeeds.  Furthermore, as various
fallocate modes can change the file contents (extending EOF,
punching holes, zeroing things, shifting extents), we should drop
file privileges like suid just like we do for a regular write().
There's already a VFS helper that figures all this out for us, so
use that.

The net effect of this is that we no longer drop suid/sgid if the
caller is root, but we also now drop file capabilities.

We also move the xfs_update_prealloc_flags() function so that it now
is only called by the scope that needs to set the the prealloc flag.

Based on a patch from Darrick Wong.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
472c6e46f5 xfs: remove XFS_PREALLOC_SYNC
Callers can acheive the same thing by calling xfs_log_force_inode()
after making their modifications. There is no need for
xfs_update_prealloc_flags() to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Darrick J. Wong
29d650f7e3 xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
Syzbot tripped over the following complaint from the kernel:

WARNING: CPU: 2 PID: 15402 at mm/util.c:597 kvmalloc_node+0x11e/0x125 mm/util.c:597

While trying to run XFS_IOC_GETBMAP against the following structure:

struct getbmap fubar = {
	.bmv_count	= 0x22dae649,
};

Obviously, this is a crazy huge value since the next thing that the
ioctl would do is allocate 37GB of memory.  This is enough to make
kvmalloc mad, but isn't large enough to trip the validation functions.
In other words, I'm fussing with checks that were **already sufficient**
because that's easier than dealing with 644 internal bug reports.  Yes,
that's right, six hundred and forty-four.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
2022-01-31 13:20:58 -08:00
Darrick J. Wong
2d86293c70 xfs: return errors in xfs_fs_sync_fs
Now that the VFS will do something with the return values from
->sync_fs, make ours pass on error codes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
2022-01-30 08:59:47 -08:00
Dave Chinner
ebb7fb1557 xfs, iomap: limit individual ioend chain lengths in writeback
Trond Myklebust reported soft lockups in XFS IO completion such as
this:

 watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
 CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
 Workqueue: xfs-conv/md127 xfs_end_io [xfs]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
 Call Trace:
  wake_up_page_bit+0x8a/0x110
  iomap_finish_ioend+0xd7/0x1c0
  iomap_finish_ioends+0x7f/0xb0
  xfs_end_ioend+0x6b/0x100 [xfs]
  xfs_end_io+0xb9/0xe0 [xfs]
  process_one_work+0x1a7/0x360
  worker_thread+0x1fa/0x390
  kthread+0x116/0x130
  ret_from_fork+0x35/0x40

Ioends are processed as an atomic completion unit when all the
chained bios in the ioend have completed their IO. Logically
contiguous ioends can also be merged and completed as a single,
larger unit.  Both of these things can be problematic as both the
bio chains per ioend and the size of the merged ioends processed as
a single completion are both unbound.

If we have a large sequential dirty region in the page cache,
write_cache_pages() will keep feeding us sequential pages and we
will keep mapping them into ioends and bios until we get a dirty
page at a non-sequential file offset. These large sequential runs
can will result in bio and ioend chaining to optimise the io
patterns. The pages iunder writeback are pinned within these chains
until the submission chaining is broken, allowing the entire chain
to be completed. This can result in huge chains being processed
in IO completion context.

We get deep bio chaining if we have large contiguous physical
extents. We will keep adding pages to the current bio until it is
full, then we'll chain a new bio to keep adding pages for writeback.
Hence we can build bio chains that map millions of pages and tens of
gigabytes of RAM if the page cache contains big enough contiguous
dirty file regions. This long bio chain pins those pages until the
final bio in the chain completes and the ioend can iterate all the
chained bios and complete them.

OTOH, if we have a physically fragmented file, we end up submitting
one ioend per physical fragment that each have a small bio or bio
chain attached to them. We do not chain these at IO submission time,
but instead we chain them at completion time based on file
offset via iomap_ioend_try_merge(). Hence we can end up with unbound
ioend chains being built via completion merging.

XFS can then do COW remapping or unwritten extent conversion on that
merged chain, which involves walking an extent fragment at a time
and running a transaction to modify the physical extent information.
IOWs, we merge all the discontiguous ioends together into a
contiguous file range, only to then process them individually as
discontiguous extents.

This extent manipulation is computationally expensive and can run in
a tight loop, so merging logically contiguous but physically
discontigous ioends gains us nothing except for hiding the fact the
fact we broke the ioends up into individual physical extents at
submission and then need to loop over those individual physical
extents at completion.

Hence we need to have mechanisms to limit ioend sizes and
to break up completion processing of large merged ioend chains:

1. bio chains per ioend need to be bound in length. Pure overwrites
go straight to iomap_finish_ioend() in softirq context with the
exact bio chain attached to the ioend by submission. Hence the only
way to prevent long holdoffs here is to bound ioend submission
sizes because we can't reschedule in softirq context.

2. iomap_finish_ioends() has to handle unbound merged ioend chains
correctly. This relies on any one call to iomap_finish_ioend() being
bound in runtime so that cond_resched() can be issued regularly as
the long ioend chain is processed. i.e. this relies on mechanism #1
to limit individual ioend sizes to work correctly.

3. filesystems have to loop over the merged ioends to process
physical extent manipulations. This means they can loop internally,
and so we break merging at physical extent boundaries so the
filesystem can easily insert reschedule points between individual
extent manipulations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-tested-by: Trond Myklebust <trondmy@hammerspace.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-26 09:19:20 -08:00
Linus Torvalds
1cb69c8044 New code for 5.17:
- Minor cleanup of ioctl32 cruft
 - Clean up open coded inodegc workqueue function calls
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHpnxUACgkQ+H93GTRK
 tOtpig/9GzzlPSpbjGMU3BW+dfqvYMLV9YcB3bkZLXPJIb/D2UPXzOQd2SvM9ScJ
 XwM857EjrmV7XifEezbBvla33N9ToNv8/IVdmctcc2as4c6VzDXmkiyze7wD+2D/
 i9UJEOjEtaTvqioOMQJpfaATI7y/9N45UFfcjkbmc4lTP0xOvd1Owx9+ujrCYZ9e
 UccsC7ovUzHxxuK1uFhRMI0up1f5urdFA0wFKvEGsfhIzBthU6BDtYUjuzhC/vLV
 DxqPlQgkRqQfzB9loFQOiNNSC2kVF+SKPXZS7k/HiQIALDOKtFNnSanXG5oPB5+d
 DnuSCvJ+RfejmEoDtMydQu0KXN5fq9g7FodH3tAMPPNn8N1l+qJuC//+6VHZjvIB
 GhfHyWkpg/mzSAAqU0sqeGSvzG1byawYNUzueIIcEWLP1qg4O3QkXW7VJ+yuGb70
 gsO7CQjnW20CmJqeftNsBaBor114UjUJLBQsXR8uowKHq/L1MhNh7Ir6lTfQdti+
 S099EVjrrFV0ZVRou1ZyQ5ZWsdGNtz0p0hEEVEPG2ERHOKfbVWLwbL38s8JY7C56
 hBVwQRwZScdIb/SYTWxPXbBJNR/BXUyxrCikk66isQT0kwFCHXE67YaXbFklxybs
 uzKzVYEBPtYhvDeFTa53GEotGw/LDtZD07E58yowlJUqmJtMLM0=
 =EIqW
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "One of the patches removes some dead code from xfs_ioctl32.h and the
  other fixes broken workqueue flushing in the inode garbage collector.

   - Minor cleanup of ioctl32 cruft

   - Clean up open coded inodegc workqueue function calls"

* tag 'xfs-5.17-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: flush inodegc workqueue tasks before cancel
  xfs: remove unused xfs_ioctl32.h declarations
2022-01-22 11:04:27 +02:00
Linus Torvalds
31d949782e Withdraw the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl definitions.
Remove the header definitions for these ioctls.  The just-removed
 implementation has allowed callers to read stale disk contents for more
 than **21 years** and nobody noticed or complained, which implies a lack
 of users aside from exploit programs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlp28ACgkQ+H93GTRK
 tOt77Q/+KP7XtwaloGbkBfMzk9JhVh4wISagQ8BxZ5Jb8cHg9ekWJz585he9siNY
 mfcuaLh5onWiBRsfqo4RXg03X4E3c/U+Q3oqe405O3TZkK2LOZMjkPE1ijDdTej6
 V9WhCTCRGuJA01sKgFuuwFJxJRjIROE29FPgRQP9EvLtEITIxLbHMFZRYKpYh143
 EhNwzQQwwPg/L4/m1qmWfC3L+Z7qTm6EQhOnpyzxlKxyX7qXIjBHi6WvErcDeJ3F
 pS7v1fwcZxctrh7PrKwhXSrkbMmd5J3p5qI/MrCtGEKWNXk+rv6AC0n4gcXvpJ/v
 wL0OTyik9pwA8V0XPQcuWvQXmrm8vR2XvMok6gXkHB1jCfzYAwJsrHQbop4pyCFe
 U3HU46x0g7UFXY7jUjztD8YNIT1+B+ducetCCGAhI97HiQrSsqSvgvPFZNle7Cef
 Oheab4iIs1zUblNrVzyGCQmK42ankypxPbfrrtvLi7SFrLRAGXeWeqDf1RXJnt5b
 xrOqCe1hgXR4RJrkTPWiiQindLlhDuywfa+Q1Y5fYZatsTtgceE/HIOg80x4pPgR
 4Ip7hW9lsjoDckpu0bC0bvYiqhrYM1eztpUToYdy7FeOkQKkPHO9xm/m1tHbqzmi
 bF3hkBo6bLByXiY/ZXzrQGrErJ6OTdNVpsR1vYjoaycrQt6wznI=
 =hq3i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs irix ioctl housecleaning from Darrick Wong:
 "Withdraw the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl definitions.

  This is the third and final of a series of small pull requests that
  perform some long overdue housecleaning of XFS ioctls. This time,
  we're withdrawing all variants of the ALLOCSP and FREESP ioctls from
  XFS' userspace API. This might be a little premature since we've only
  just removed the functionality, but as I pointed out in the last pull
  request, nobody (including fstests) noticed that it was broken for 20
  years.

  In response to the patch, we received a single comment from someone
  who stated that they 'augment' the ioctl for their own purposes, but
  otherwise acquiesced to the withdrawal. I still want to try to clobber
  these old ioctl definitions in 5.17.

  So remove the header definitions for these ioctls. The just-removed
  implementation has allowed callers to read stale disk contents for
  more than **21 years** and nobody noticed or complained, which implies
  a lack of users aside from exploit programs"

* tag 'xfs-5.17-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove the XFS_IOC_{ALLOC,FREE}SP* definitions
2022-01-21 08:51:48 +02:00
Linus Torvalds
d701a8ccac Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families.
Linux has always used fallocate as the space management system call,
 whereas these Irix legacy ioctls only ever worked on XFS, and have been
 the cause of recent stale data disclosure vulnerabilities.  As
 equivalent functionality is available elsewhere, remove the code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlplQACgkQ+H93GTRK
 tOtAcRAAg11WggF9ycNLwnczUs4NmTV1cwhz8+eTuwr2yul3gl/mrO3MyjMmkrnm
 1rXjwg28GKtps04Ugh+8TTL+QkDn6Uteco27OZbmUf00a0MoC7JG4VkEQVtXjcaK
 zvevfutTH7Vnl49m+YBrLtonrTqmND46quKoPKsv0a5nlbXHSNMouUkayWXDSyOl
 8tRcNWLy76L+XCxEU21cD1NBw3Vr0mCiId4xTcbNFw3TUVAGoZgghzC2d/gHFiwN
 1PM7G51TKUNm3dybH0mt/jLF/fLsVxFnznnlW4bb/XzMuU4geqd0r1AQuIdbwZa9
 uB+PkFWwN5frTEFELYTamAa4LlAe2oQ0hmSGLfC/zEtPcOv4h6qHNgRsN9wfG+H9
 oYUeRY+2zHcD7jYJsaZZt5WCIDVncOlJMclRdpbpujkJzJX9ZjAi++PTgDxdMjFa
 egwDAvOdgijgtz8erN0gglJrqJzQQp6ByNtht5rZjHz7LkrWYtt57TOoS986pW7X
 /MwBLjT/4Xig/XaFVrmMohF3VPrG/eH/DpTnHotzQzZRYQWbKZwCgin6+kKC8cV8
 Y+eE1jKZunL4Ms/GmrxencNzsDSJtkKyR5LkHCqgH8YUPJM3vYDcleZY+UgEKq0a
 z0fw3MZvxM2jsUIk7+J8uQ8esSqUb5hNXkUJsUraUtG3Z6ZeaOg=
 =2QZ3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs irix ioctl housecleaning from Darrick Wong:
 "Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families.

  This is the second of a series of small pull requests that perform
  some long overdue housecleaning of XFS ioctls. This time, we're
  vacating the implementation of all variants of the ALLOCSP and FREESP
  ioctls, which are holdovers from EFS in Irix, circa 1993. Roughly
  equivalent functionality have been available for both ioctls since
  2.6.25 (April 2008):

   - XFS_IOC_FREESP ftruncates a file.

   - XFS_IOC_ALLOCSP is the equivalent of fallocate.

  As noted in the fix patch for CVE 2021-4155, the ALLOCSP ioctl has
  been serving up stale disk blocks since 2000, and in 21 years
  **nobody** noticed. On those grounds I think it's safe to vacate the
  implementation.

  Note that we lose the ability to preallocate and truncate relative to
  the current file position, but as nobody's ever implemented that for
  the VFS, I conclude that it's not in high demand.

  Linux has always used fallocate as the space management system call,
  whereas these Irix legacy ioctls only ever worked on XFS, and have
  been the cause of recent stale data disclosure vulnerabilities. As
  equivalent functionality is available elsewhere, remove the code"

* tag 'xfs-5.17-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls
2022-01-21 08:47:25 +02:00
Linus Torvalds
12a8fb20f1 Withdraw the ioctl definition for the FSSETDM ioctl.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlpbEACgkQ+H93GTRK
 tOs8MBAAjOyUswNqVMYgMB6eww7oT8j54jGdTmo+d+N2hIoX4OAV6GdCI6te/7DE
 yz6TtzbQ9fs9lDYsdRsMjfwr18R8N1m0D23cpFSJ6qq6e2NJi0mqVwmC8mPRNAJH
 68yL0p6O4Tk5jj7kae+zsv50nsSANXclfoHjsZQ0DGOuGagmPkNVqT82KIC4n3dn
 aw66xhuaiLLJE+4boLe4NexRVBbyOuHQ2uB7xUnKwc9tHvjAf8EFCIhUV1wp0qZf
 CfA2wg8+Jzwrqz/gVRKUZOjz7LeIY6E2qCBrA+DATv2dcv7QhmvGHaQ9OkrvIE72
 CbvI92IhvOcKFzpfMrRYGhOh7KE6SkxLGqsAXgnjPoFQkCDudgCaExBO96RMMd6u
 cX43mXWZbUl++Sh2GhPD/xkiskLRZFjiHJbKBX/5nwjU2BzTHQY/7Jy07fIkR4c4
 IrkKgiXfSJT4j/KeAMkBpZ7THMjRMSUgwliSWHL0QWUz5Bou8WRnHUl8CMsu9vDJ
 fYeekXDQYuAX+UrcsDlbA0UukigOLSIZiQTAEgSbIkd/+Zb6U6e0IF7pTTZJ9uFs
 bndLFYqZtEAySDrMCBM+W8VYmR48EDxfN8xsdS1kbZIqEdNhmkEMj9tMf0rs+FRi
 lo1vMi08O7VcuyiyNrKs0e1d1Gkd2jwmwIskSQweslP5BbfPzRE=
 =d+fT
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs ioctl housecleaning from Darrick Wong:
 "This is the first of a series of small pull requests that perform some
  long overdue housecleaning of XFS ioctls. This first pull request
  removes the FSSETDM ioctl, which was used to set DMAPI event
  attributes on XFS files. The DMAPI support has never been merged
  upstream and the implementation of FSSETDM itself was removed two
  years ago, so let's withdraw it completely.

   - Withdraw the ioctl definition for the FSSETDM ioctl"

* tag 'xfs-5.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove the XFS_IOC_FSSETDM definitions
2022-01-21 08:44:07 +02:00
Brian Foster
6191cf3ad5 xfs: flush inodegc workqueue tasks before cancel
The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning.  While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

	xfs_destroy_inode: ... ino 0x83 ...
	xfs_inode_set_need_inactive: ... ino 0x83 ...
	xfs_inodegc_stop: ...
	...
	xfs_inodegc_start: ...
	xfs_inodegc_worker: ...
	xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-19 14:58:26 -08:00
Darrick J. Wong
a8e422af69 xfs: remove unused xfs_ioctl32.h declarations
Remove these unused ia32 compat declarations; all the bits involved have
either been withdrawn or hoisted to the VFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-01-18 10:18:36 -08:00