87633 Commits

Author SHA1 Message Date
Kent Overstreet
103ffe9aaf bcachefs: x-macro-ify inode flags enum
This lets us use bch2_prt_bitflags to print them out.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05 13:12:18 -05:00
Kent Overstreet
d4c8bb69d0 bcachefs: Convert bch2_fs_open() to darray
Open coded dynamic arrays are deprecated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05 13:12:17 -05:00
Kent Overstreet
0f0fc31238 bcachefs: Move __bch2_members_v2_get_mut to sb-members.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05 13:12:17 -05:00
Kent Overstreet
59154f2c66 bcachefs: bch2_prt_datetime()
Improved, better named version of pr_time().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05 13:12:08 -05:00
Kent Overstreet
bf61dcdfc1 bcachefs: CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y
BCACHEFS_DEBUG_TRANSACTIONS is useful, but it's too expensive to have on
by default - and it hasn't been coming up in bug reports.

Turn it off by default until we figure out a way to make it cheaper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
9fcdd23b6e bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usage
BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization
related to inode updates and fsync - document it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
d3c7727bb9 bcachefs: rebalance_work btree is not a snapshots btree
rebalance_work entries may refer to entries in the extents btree, which
is a snapshots btree, or they may also refer to entries in the reflink
btree, which is not.

Hence rebalance_work keys may use the snapshot field but it's not
required to be nonzero - add a new btree flag to reflect this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
01ccee225a bcachefs: Add missing printk newlines
This was causing error messages in -tools to not get printed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
5a53f851e6 bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entry
When we didn't find anything in the journal that we'd like to use, and
we're forced to use whatever we can find - that entry will have been a
JSET_NO_FLUSH entry with a garbage last_seq value, since it's not
normally used.

Initialize it to something sane, for bch2_fs_journal_start().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
ce3e9a8a10 bcachefs: .get_parent() should return an error pointer
Delete the useless check for inum == 0; we'll return -ENOENT without it,
which is what we want.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
4bd156c4b4 bcachefs: Fix bch2_delete_dead_inodes()
- the fsck_err() check for the filesystem being clean was incorrect,
   causing us to always fail to delete unlinked inodes
 - if a snapshot had been taken, the unlinked inode needs to be
   propagated to snapshot leaves so the unlink can happen there - fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
7cb2a7895d bcachefs: use swab40 for bch_backpointer.bucket_offset bitfield
The bucket_offset field of bch_backpointer is a 40-bit bitfield, but the
bch2_backpointer_swab() helper uses swab32. This leads to inconsistency
when an on-disk fs is accessed from an opposite endian machine.

As it turns out, we already have an internal swab40() helper that is
used from the bch_alloc_v4 swab callback. Lift it into the backpointers
header file and use it consistently in both places.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
0996c72a0f bcachefs: byte order swap bch_alloc_v4.fragmentation_lru field
A simple test to populate a filesystem on one CPU architecture and
fsck on an arch of the opposite byte order produces errors related
to the fragmentation LRU. This occurs because the 64-bit
fragmentation_lru field is not byte-order swapped when reads detect
that the on-disk/bset key values were written in opposite byte-order
of the current CPU.

Update the bch2_alloc_v4 swab callback to handle fragmentation_lru
as is done for other multi-byte fields. This doesn't affect existing
filesystems when accessed by CPUs of the same endianness because the
->swab() callback is only called when the bset flags indicate an
endianness mismatch between the CPU and on-disk data.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
2a4e749760 bcachefs: allow writeback to fill bio completely
The bcachefs folio writeback code includes a bio full check as well
as a fixed size check to determine when to split off and submit
writeback I/O. The inclusive check of the latter against the limit
means that writeback can submit slightly prematurely. This is not a
functional problem, but results in unnecessarily split I/Os and
extent merging.

This can be observed with a buffered write sized exactly to the
current maximum value (1MB) and with key_merging_disabled=1. The
latter prevents the merge from the second write such that a
subsequent check of the extent list shows a 1020k extent followed by
a contiguous 4k extent.

The purpose for the fixed size check is also undocumented and
somewhat obscure. Lift this check into a new helper that wraps the
bio check, fix the comparison logic, and add a comment to document
the purpose and how we might improve on this in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
0e91d3a6d5 bcachefs: fix odebug warn and lockdep splat due to on-stack rhashtable
Guenter Roeck reports a lockdep splat and DEBUG_OBJECTS_WORK related
warning when bch2_copygc_thread() initializes its rhashtable. The
lockdep splat relates to a warning print caused by the fact that the
rhashtable exists on the stack but is not annotated as so. This is
something that could be addressed by INIT_WORK_ONSTACK(), but
rhashtable doesn't expose that control and probably isnt worth the
churn for just one user. Instead, dynamically allocate the
buckets_in_flight structure and avoid the splat that way.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
e0fb0dccfd bcachefs: update alloc cursor in early bucket allocator
A recent bug report uncovered a scenario where a filesystem never
runs with freespace_initialized, and therefore the user observes
significantly degraded write performance by virtue of running the
early bucket allocator. The associated bug aside, the primary cause
of the performance drop in this particular instance is that the
early bucket allocator does not update the allocation cursor. This
means that every allocation walks the alloc btree from the first
bucket of the associated device looking for a bucket marked as free
space.

Update the early allocator code to set the alloc cursor to the last
processed position in the tree, similar to how the freelist
allocator behaves. With the alloc_cursor being updated, the retry
logic also needs to be updated to restart from the beginning of the
device when a free bucket is not available between the cursor and
the end of the device. Track the restart position in a first_bucket
variable to make the code a bit more easily readable and consistent
with the freelist allocator.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Brian Foster
385a82f62a bcachefs: serialize on cached key in early bucket allocator
bcachefs had a transient bug where freespace_initialized was not
properly being set, which lead to unexpected use of the early bucket
allocator at runtime. This issue has been fixed, but the existence
of it uncovered a coherency issue in the early bucket allocation
code that is somewhat related to how uncached iterators deal with
the key cache.

The problem itself manifests as occasional failure of generic/113
due to corruption, often seen as a duplicate backpointer or multiple
data types per-bucket error. The immediate cause of the error is a
racing bucket allocation along the lines of the following sequence:

- Task 1 selects key A in bch2_bucket_alloc_early() and schedules.
- Task 2 selects the same key A, but proceeds to complete the
  allocation and associated I/O, after which it releases the
  open_bucket.
- Task 1 resumes with key A, but does not recognize the bucket is
  now allocated because the open_bucket has been removed
  from the hash when it was released in the previous step.

This generally shouldn't happen because the allocating task updates
the alloc btree key before releasing the bucket. This is not
sufficient in this particular instance, however, because an uncached
iterator for a cached btree doesn't actually lock the key cache slot
when no key exists for a given slot in the cache. Thus the fact that
the allocation side updates the cached key means that multiple
uncached iters can stumble across the same alloc key and duplicate
the bucket allocation as described above.

This is something that probably needs a longer term fix in the
iterator code. As a short term fix, close the race through explicit
use of a cached iterator for likely allocation candidates. We don't
want to scan the btree with a cached iterator because that would
unnecessarily pollute the cache. This mitigates cache pollution by
primarily scanning the tree with an uncached iterator, but closes
the race by creating a key cache entry for any prospective slot
prior to the bucket allocation attempt (also similar to how
_alloc_freelist() works via try_alloc_bucket()). This survives many
iterations of generic/113 on a kernel hacked to always use the early
bucket allocator.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
f82755e4e8 bcachefs: Data move path now uses bch2_trans_unlock_long()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:11 -04:00
Linus Torvalds
aea6bf908d f2fs update for 6.7-rc1
In this cycle, we introduce a bigger page size support by changing the internal
 f2fs's block size aligned to the page size. We also continue to improve zoned
 block device support regarding the power off recovery. As usual, there are some
 bug fixes regarding the error handling routines in compression and ioctl.
 
 Enhancement:
  - Support Block Size == Page Size
  - let f2fs_precache_extents() traverses in file range
  - stop iterating f2fs_map_block if hole exists
  - preload extent_cache for POSIX_FADV_WILLNEED
  - compress: fix to avoid fragment w/ OPU during f2fs_ioc_compress_file()
 
 Bug fix:
  - do not return EFSCORRUPTED, but try to run online repair
  - finish previous checkpoints before returning from remount
  - fix error handling of __get_node_page and __f2fs_build_free_nids
  - clean up zones when not successfully unmounted
  - fix to initialize map.m_pblk in f2fs_precache_extents()
  - fix to drop meta_inode's page cache in f2fs_put_super()
  - set the default compress_level on ioctl
  - fix to avoid use-after-free on dic
  - fix to avoid redundant compress extension
  - do sanity check on cluster when CONFIG_F2FS_CHECK_FS is on
  - fix deadloop in f2fs_write_cache_pages()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmVGlP4ACgkQQBSofoJI
 UNKahg/7BuiBi2+NZ93WeUoRURBNa4H1cK93GtyE4rPMdEjVPujxonnqzC/N2nx5
 1ZjBLEjV0mCXfqlm+D+ORkZDIeCO+PcPI+vLjjHG+qhphWsf9wUJnWIUf+LhmXDP
 p0ertHlZrSONG+HsbXl6BOEq/LWYTe6WDOwY0MMO4U7IxHVyKUuFJSL6DQRRL2r2
 Op4/NOes/cvj8dXvUZtyF3tQHyhkTbdPl8pfFtmagLlujC2WOuySKlfHUF8pxdv4
 RT3TER66Rs8IStrFtyuv6yHHtvpfl8jTuJ1DNaNphBCi2RlERvx7+zY3iyLIZgIZ
 zKxDkGUIb/UnKmoipiu/ZkUvJ6wi2IfP4C5hffAHBsxwyVlrSldCXioE4j5Ysbcm
 pOWjgB7ASyGVU6yxyUpoQUlW5oztPwqkjnhfeXu6cgHOLRWdw224hJ6bA4XU3E61
 R2rdaNrpGICwl3juFPxH7uHxNjnPTJB1G38LJApAypkoHoa/et5foZtK0pwSxMD+
 j8ZkczJvdAgkk6fbFzPA1Urg+6Qhx8SpDeNxCXIS/F7izPZhkvyZ45SBuWboJXld
 8rzel9JzswdB2dpeXm0VVofwBg9l8CG5b7mwzmgHpmfD2MjdstLKp9aO5G2SUSJt
 BKArSUfipzQhljv0xhous0OQitKLhg83o6+KMAXw3OpUlmWGLV4=
 =U4Ke
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we introduce a bigger page size support by changing the
  internal f2fs's block size aligned to the page size. We also continue
  to improve zoned block device support regarding the power off
  recovery. As usual, there are some bug fixes regarding the error
  handling routines in compression and ioctl.

  Enhancements:
   - Support Block Size == Page Size
   - let f2fs_precache_extents() traverses in file range
   - stop iterating f2fs_map_block if hole exists
   - preload extent_cache for POSIX_FADV_WILLNEED
   - compress: fix to avoid fragment w/ OPU during f2fs_ioc_compress_file()

  Bug fixes:
   - do not return EFSCORRUPTED, but try to run online repair
   - finish previous checkpoints before returning from remount
   - fix error handling of __get_node_page and __f2fs_build_free_nids
   - clean up zones when not successfully unmounted
   - fix to initialize map.m_pblk in f2fs_precache_extents()
   - fix to drop meta_inode's page cache in f2fs_put_super()
   - set the default compress_level on ioctl
   - fix to avoid use-after-free on dic
   - fix to avoid redundant compress extension
   - do sanity check on cluster when CONFIG_F2FS_CHECK_FS is on
   - fix deadloop in f2fs_write_cache_pages()"

* tag 'f2fs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: finish previous checkpoints before returning from remount
  f2fs: fix error handling of __get_node_page
  f2fs: do not return EFSCORRUPTED, but try to run online repair
  f2fs: fix error path of __f2fs_build_free_nids
  f2fs: Clean up errors in segment.h
  f2fs: clean up zones when not successfully unmounted
  f2fs: let f2fs_precache_extents() traverses in file range
  f2fs: avoid format-overflow warning
  f2fs: fix to initialize map.m_pblk in f2fs_precache_extents()
  f2fs: Support Block Size == Page Size
  f2fs: stop iterating f2fs_map_block if hole exists
  f2fs: preload extent_cache for POSIX_FADV_WILLNEED
  f2fs: set the default compress_level on ioctl
  f2fs: compress: fix to avoid fragment w/ OPU during f2fs_ioc_compress_file()
  f2fs: fix to drop meta_inode's page cache in f2fs_put_super()
  f2fs: split initial and dynamic conditions for extent_cache
  f2fs: compress: fix to avoid redundant compress extension
  f2fs: compress: do sanity check on cluster when CONFIG_F2FS_CHECK_FS is on
  f2fs: compress: fix to avoid use-after-free on dic
  f2fs: compress: fix deadloop in f2fs_write_cache_pages()
2023-11-04 09:26:23 -10:00
Linus Torvalds
c9b93cafb6 Bunch of small fixes:
- three W=1 warning fixes: the NULL -> "" replacement isn't trivial
 but is serialized identically by the protocol layer and has been tested
 - one syzbot/KCSAN datarace annotation where we don't care about users
 messing with the fd they passed to mount -t 9p
 - removing a declaration without implementation
 - yet another race fix for trans_fd around connection close:
 the 'err' field is also used in potentially racy calls and this
 isn't complete, but it's better than what we have.
 - and finally a theorical memory leak fix on serialization failure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmVFmAcACgkQq06b7GqY
 5nASEg/9Fdtd9B//G+t/gQm0pdZIGoV+KbovWRrO4w3IIfENe8OP3A+8uZhmdij4
 p1tYUfyOPjMTxvZlmEArYjyiIkB5SsdYVMiOY6WfFRo3XEyMirdLDXa7beSnFLAs
 wuk5rAlndGIg+HwzXFD4WpKU5By5r6hnMgYeQ98bwB1bsF5ENZL+0VctFOd1H1zQ
 zwFChfsv4do3LfA2BssseTaH3PAniqs/X8gvDfslCL45YhitunzrB5C8pCOLpiA6
 j6+7iDBeh7H/S+xTG2h+NBdPh+emalex1GHckZHNaTyEWNGcnJnOG8qPy+ufNgyX
 G1Sxr80dX7oAXD74LqPfoG4IsgXsB/0sLnjFNxbfpcmGoka/GTWzV0O+9LauhN9X
 Nu3OnbOBM+VZR3VA7EpqCEbf5CAduQklWameZHGqqH8vr3wGJSPXs3eEpTnXnXmK
 yIUdFWvTL61Il/RSk1CFjlsxXh42tjNqlJobiOru3+zhmwFw3yffFnq3Mxn1Y+Mi
 4jet/E1VD7NB6nlTqRuaMWbWBC037pMgXfFe4BGbAynKLPbnBD8fjDKkbxoU66Qw
 ud5UabfuikR1ah5te8l1F956ij88HgKFuufQ6vqc98be3EqcIkXSe0/+fgXMcU55
 hPMmfLAP4g5p5VEi6kbOZdVAWqaE6EO1x6EzrQ+VCEg5hHJx0Ew=
 =ZD9A
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.7-rc1' of https://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 A bunch of small fixes:

   - three W=1 warning fixes: the NULL -> "" replacement isn't trivial
     but is serialized identically by the protocol layer and has been
     tested

   - one syzbot/KCSAN datarace annotation where we don't care about
     users messing with the fd they passed to mount -t 9p

   - removing a declaration without implementation

   - yet another race fix for trans_fd around connection close: the
     'err' field is also used in potentially racy calls and this isn't
     complete, but it's better than what we had

   - and finally a theorical memory leak fix on serialization failure"

* tag '9p-for-6.7-rc1' of https://github.com/martinetd/linux:
  9p/net: fix possible memory leak in p9_check_errors()
  9p/fs: add MODULE_DESCRIPTION
  9p/net: xen: fix false positive printf format overflow warning
  9p: v9fs_listxattr: fix %s null argument warning
  9p/trans_fd: Annotate data-racy writes to file::f_flags
  fs/9p: Remove unused function declaration v9fs_inode2stat()
  9p/trans_fd: avoid sending req to a cancelled conn
2023-11-04 09:20:04 -10:00
Linus Torvalds
766e9cf3bd 15 cifs client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmVFoW8ACgkQiiy9cAdy
 T1HFtQwAvvJH7wsCuN/QYCMPbXJ150fgmgMs+FJeq/gMRJaRjeFLpR9NfO6DAYbA
 NraRyfoinxvBkftXVQkDUKn0xSdxz1JEZodq08+kvEZyIQp5z9Q5NTsSF4loyMlg
 jtRM6UtkiFmsaRrbLL+EguZiQkhV8/SyGtojxn5jozubXUZS1+llMZk56MbOxIdY
 RRGjndlxRxtJWP/xCHLwnGF7wRWqeiQhNPml1DoEVo2LOZK7OTqZP5cckwusj0YR
 nAaSffn3U2v5tVPOpHMZIZbAcRIawRTjv6CT+hMH5/9k1JOy4vc8Y92uiO7DLx2O
 uoI4fOs9hX3yWvJCe9QExvOjvmtWaAY8uhI8dGEo2mWvRQqdPBGcrVXpUqbVY1r0
 ffb6N+CJ5VZmbgIi57+3FUU2YuJnIxyWj0uQCJr7Q6E9sZiHx2IZ2Y3Yck5yy6Bk
 ZsZLQcB5xQiosCjt2xR9UmXv/bUCH3TwCY/TUPARkf+QQCo58DQokczy0UtEffzB
 rLxedNQz
 =zu9T
 -----END PGP SIGNATURE-----

Merge tag '6.7-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client updates from Steve French:

 - use after free fixes and deadlock fix

 - symlink timestamp fix

 - hashing perf improvement

 - multichannel fixes

 - minor debugging improvements

 - fix creating fifos when using "sfu" mounts

 - NTLMSSP authentication improvement

 - minor fixes to include some missing create flags and structures from
   recently updated protocol documentation

* tag '6.7-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: force interface update before a fresh session setup
  cifs: do not reset chan_max if multichannel is not supported at mount
  cifs: reconnect helper should set reconnect for the right channel
  smb: client: fix use-after-free in smb2_query_info_compound()
  smb: client: remove extra @chan_count check in __cifs_put_smb_ses()
  cifs: add xid to query server interface call
  cifs: print server capabilities in DebugData
  smb: use crypto_shash_digest() in symlink_hash()
  smb: client: fix use-after-free bug in cifs_debug_data_proc_show()
  smb: client: fix potential deadlock when releasing mids
  smb3: fix creating FIFOs when mounting with "sfu" mount option
  Add definition for new smb3.1.1 command type
  SMB3: clarify some of the unused CreateOption flags
  cifs: Add client version details to NTLM authenticate message
  smb3: fix touch -h of symlink
2023-11-04 09:13:50 -10:00
Linus Torvalds
4c975a43fa EFI update for v6.7
- implement uid/gid mount options for efivarfs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZUV51gAKCRAwbglWLn0t
 XBQVAP9a2PAeevQ9gA29rI+2caC9tpgcNPoiAsFiod8jrIymcwEAtdZAp98T8Wsc
 egjnvwNjzd2nTvrL1aZXKl4Id8jn2Qo=
 =VM67
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI update from Ard Biesheuvel:
 "This is the only remaining EFI change, as everything else was taken
  via -tip this cycle:

   - implement uid/gid mount options for efivarfs"

* tag 'efi-next-for-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efivarfs: Add uid/gid mount options
2023-11-04 08:54:20 -10:00
Kent Overstreet
c4accde498 bcachefs: Ensure srcu lock is not held too long
The SRCU read lock that btree_trans takes exists to make it safe for
bch2_trans_relock() to deref pointers to btree nodes/key cache items we
don't have locked, but as a side effect it blocks reclaim from freeing
those items.

Thus, it's important to not hold it for too long: we need to
differentiate between bch2_trans_unlock() calls that will be only for a
short duration, and ones that will be for an unbounded duration.

This introduces bch2_trans_unlock_long(), to be used mainly by the data
move paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 14:17:11 -04:00
Kent Overstreet
6dfa10ab22 bcachefs: Fix build errors with gcc 10
gcc 10 seems to complain about array bounds in situations where gcc 11
does not - curious.

This unfortunately requires adding some casts for now; we may
investigate getting rid of our __u64 _data[] VLA in a future patch so
that our start[0] members can be VLAs.

Reported-by: John Stoffel <john@stoffel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 14:17:11 -04:00
Kent Overstreet
4db8ac8629 bcachefs: Fix MEAN_AND_VARIANCE kconfig options
Fixes:

https://lore.kernel.org/linux-bcachefs/CAMuHMdXpwMdLuoWsNGa8qacT_5Wv-vSTz0xoBR5n_fnD9cNOuQ@mail.gmail.com/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 14:17:11 -04:00
Kent Overstreet
1f7056b735 bcachefs: Ensure copygc does not spin
If copygc does no work - finds no fragmented buckets - wait for a bit of
IO to happen.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 14:17:11 -04:00
Linus Torvalds
b06f58ad8e Driver core changes for 6.7-rc1
Here is the set of driver core updates for 6.7-rc1.  Nothing major in
 here at all, just a small number of changes including:
   - minor cleanups and updates from Andy Shevchenko
   - __counted_by addition
   - firmware_loader update for aborting loads cleaner
   - other minor changes, details in the shortlog
   - documentation update
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZUTe8A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynP0QCfT2jQx3OcL22MoqCvdTuZJKPiHSIAoMxrliJF
 d4cUeICW17ywlTFzsKg8
 =nTeu
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core updates for 6.7-rc1. Nothing major in
  here at all, just a small number of changes including:

   - minor cleanups and updates from Andy Shevchenko

   - __counted_by addition

   - firmware_loader update for aborting loads cleaner

   - other minor changes, details in the shortlog

   - documentation update

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
  firmware_loader: Abort all upcoming firmware load request once reboot triggered
  firmware_loader: Refactor kill_pending_fw_fallback_reqs()
  Documentation: security-bugs.rst: linux-distros relaxed their rules
  driver core: Release all resources during unbind before updating device links
  driver core: class: remove boilerplate code
  driver core: platform: Annotate struct irq_affinity_devres with __counted_by
  resource: Constify resource crosscheck APIs
  resource: Unify next_resource() and next_resource_skip_children()
  resource: Reuse for_each_resource() macro
  PCI: Implement custom llseek for sysfs resource entries
  kernfs: sysfs: support custom llseek method for sysfs entries
  debugfs: Fix __rcu type comparison warning
  device property: Replace custom implementation of COUNT_ARGS()
  drivers: base: test: Make property entry API test modular
  driver core: Add missing parameter description to __fwnode_link_add()
  device property: Clarify usage scope of some struct fwnode_handle members
  devres: rename the first parameter of devm_add_action(_or_reset)
  driver core: platform: Unify the firmware node type check
  driver core: platform: Use temporary variable in platform_device_add()
  driver core: platform: Refactor error path in a couple places
  ...
2023-11-03 15:15:47 -10:00
Christian Brauner
56d2e2cfa2 ceph: allow idmapped mounts
Now that we converted cephfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
8a051b40ab ceph: allow idmapped atomic_open inode op
Enable ceph_atomic_open() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to 5fadbd9929 ("ceph: rely on vfs for
  setgid stripping") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
2cce72dda2 ceph: allow idmapped set_acl inode op
Enable ceph_set_acl() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
a04aff2588 ceph: allow idmapped setattr inode op
Enable __ceph_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to b27c82e12965 ("attr: port attribute
  changes to new types") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn
79c66a0c8c ceph: pass idmap to __ceph_setattr
Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
8995375fae ceph: allow idmapped permission inode op
Enable ceph_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
0513043ec4 ceph: allow idmapped getattr inode op
Enable ceph_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
09838f1bfd ceph: pass an idmapping to mknod/symlink/mkdir
Enable mknod/symlink/mkdir iops to handle idmapped mounts.
This is just a matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn
673478b6e5 ceph: add enable_unsafe_idmap module parameter
This parameter is used to decide if we allow
to perform IO on idmapped mount in case when MDS lacks
support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature.

In this case we can't properly handle MDS permission
checks and if UID/GID-based restrictions are enabled
on the MDS side then IO requests which go through an
idmapped mount may fail with -EACCESS/-EPERM.
Fortunately, for most of users it's not a case and
everything should work fine. But we put work "unsafe"
in the module parameter name to warn users about
possible problems with this feature and encourage
update of cephfs MDS.

Suggested-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner
5ccd8530dd ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.

In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.

This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].

[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/

Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner
9c2df2271c ceph: stash idmapping in mdsc request
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.

If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.

In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.

In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Alexander Mikhalitsyn
1b90344614 fs: export mnt_idmap_get/mnt_idmap_put
These helpers are required to support idmapped mounts in CephFS.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
522dc5108f libceph, ceph: move mdsmap.h to fs/ceph
The mdsmap.h is only used by CephFS, so move it to fs/ceph.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
38d46409c4 ceph: print cluster fsid and client global_id in all debug logs
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
5995d90d2d ceph: rename _to_client() to _to_fs_client()
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
197b7d792d ceph: pass the mdsc to several helpers
We will use the 'mdsc' to get the global_id in the following commits.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Linus Torvalds
31e5f934ff Tracing updates for v6.7:
- Remove eventfs_file descriptor
 
   This is the biggest change, and the second part of making eventfs
   create its files dynamically.
 
   In 6.6 the first part was added, and that maintained a one to one
   mapping between eventfs meta descriptors and the directories and
   file inodes and dentries that were dynamically created. The
   directories were represented by a eventfs_inode and the files
   were represented by a eventfs_file.
 
   In v6.7 the eventfs_file is removed. As all events have the same
   directory make up (sched_switch has an "enable", "id", "format",
   etc files), the handing of what files are underneath each leaf
   eventfs directory is moved back to the tracing subsystem via a
   callback. When a event is added to the eventfs, it registers
   an array of evenfs_entry's. These hold the names of the files and
   the callbacks to call when the file is referenced. The callback gets
   the name so that the same callback may be used by multiple files.
   The callback then supplies the filesystem_operations structure needed
   to create this file.
 
   This has brought the memory footprint of creating multiple eventfs
   instances down by 2 megs each!
 
 - User events now has persistent events that are not associated
   to a single processes. These are privileged events that hang around
   even if no process is attached to them.
 
 - Clean up of seq_buf.
   There's talk about using seq_buf more to replace strscpy() and friends.
   But this also requires some minor modifications of seq_buf to be
   able to do this.
 
 - Expand instance ring buffers individually
   Currently if boot up creates an instance, and a trace event is
   enabled on that instance, the ring buffer for that instance and the
   top level ring buffer are expanded (1.4 MB per CPU). This wastes
   memory as this happens when nothing is using the top level instance.
 
 - Other minor clean ups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZUMrBBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quzVAQCed/kPM7X9j2QZamJVDruMf2CmVxpu
 /TOvKvSKV584GgEAxLntf5VKx1Q98bc68y3Zkg+OCi8jSgORos1ROmURhws=
 =iIgb
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Remove eventfs_file descriptor

   This is the biggest change, and the second part of making eventfs
   create its files dynamically.

   In 6.6 the first part was added, and that maintained a one to one
   mapping between eventfs meta descriptors and the directories and file
   inodes and dentries that were dynamically created. The directories
   were represented by a eventfs_inode and the files were represented by
   a eventfs_file.

   In v6.7 the eventfs_file is removed. As all events have the same
   directory make up (sched_switch has an "enable", "id", "format", etc
   files), the handing of what files are underneath each leaf eventfs
   directory is moved back to the tracing subsystem via a callback.

   When an event is added to the eventfs, it registers an array of
   evenfs_entry's. These hold the names of the files and the callbacks
   to call when the file is referenced. The callback gets the name so
   that the same callback may be used by multiple files. The callback
   then supplies the filesystem_operations structure needed to create
   this file.

   This has brought the memory footprint of creating multiple eventfs
   instances down by 2 megs each!

 - User events now has persistent events that are not associated to a
   single processes. These are privileged events that hang around even
   if no process is attached to them

 - Clean up of seq_buf

   There's talk about using seq_buf more to replace strscpy() and
   friends. But this also requires some minor modifications of seq_buf
   to be able to do this

 - Expand instance ring buffers individually

   Currently if boot up creates an instance, and a trace event is
   enabled on that instance, the ring buffer for that instance and the
   top level ring buffer are expanded (1.4 MB per CPU). This wastes
   memory as this happens when nothing is using the top level instance

 - Other minor clean ups and fixes

* tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
  seq_buf: Export seq_buf_puts()
  seq_buf: Export seq_buf_putc()
  eventfs: Use simple_recursive_removal() to clean up dentries
  eventfs: Remove special processing of dput() of events directory
  eventfs: Delete eventfs_inode when the last dentry is freed
  eventfs: Hold eventfs_mutex when calling callback functions
  eventfs: Save ownership and mode
  eventfs: Test for ei->is_freed when accessing ei->dentry
  eventfs: Have a free_ei() that just frees the eventfs_inode
  eventfs: Remove "is_freed" union with rcu head
  eventfs: Fix kerneldoc of eventfs_remove_rec()
  tracing: Have the user copy of synthetic event address use correct context
  eventfs: Remove extra dget() in eventfs_create_events_dir()
  tracing: Have trace_event_file have ref counters
  seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()
  eventfs: Fix typo in eventfs_inode union comment
  eventfs: Fix WARN_ON() in create_file_dentry()
  powerpc: Remove initialisation of readpos
  tracing/histograms: Simplify last_cmd_set()
  seq_buf: fix a misleading comment
  ...
2023-11-03 07:41:18 -10:00
Yuezhang Mo
1373ca10ec exfat: fix ctime is not updated
Commit 4c72a36edd54 ("exfat: convert to new timestamp accessors")
removed attr_copy() from exfat_set_attr().
It causes xfstests generic/221 to fail. In xfstests generic/221,
it tests ctime should be updated even if futimens() update atime
only. But in this case, ctime will not be updated if attr_copy()
is removed.

attr_copy() may also update other attributes, and removing it may
cause other bugs, so this commit restores to call attr_copy() in
exfat_set_attr().

Fixes: 4c72a36edd54 ("exfat: convert to new timestamp accessors")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2023-11-03 22:24:11 +09:00
Yuezhang Mo
fc12a722e6 exfat: fix setting uninitialized time to ctime/atime
An uninitialized time is set to ctime/atime in __exfat_write_inode().
It causes xfstests generic/003 and generic/192 to fail.

And since there will be a time gap between setting ctime/atime to
the inode and writing back the inode, so ctime/atime should not be
set again when writing back the inode.

Fixes: 4c72a36edd54 ("exfat: convert to new timestamp accessors")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2023-11-03 22:24:07 +09:00
Linus Torvalds
8f6f76a6a2 As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs.
 
 The lengthier patch series are
 
 - "kdump: use generic functions to simplify crashkernel reservation in
   arch", from Baoquan He.  This is mainly cleanups and consolidation of
   the "crashkernel=" kernel parameter handling.
 
 - After much discussion, David Laight's "minmax: Relax type checks in
   min() and max()" is here.  Hopefully reduces some typecasting and the
   use of min_t() and max_t().
 
 - A group of patches from Oleg Nesterov which clean up and slightly fix
   our handling of reads from /proc/PID/task/...  and which remove
   task_struct.therad_group.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
 jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
 =On4u
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "As usual, lots of singleton and doubleton patches all over the tree
  and there's little I can say which isn't in the individual changelogs.

  The lengthier patch series are

   - 'kdump: use generic functions to simplify crashkernel reservation
     in arch', from Baoquan He. This is mainly cleanups and
     consolidation of the 'crashkernel=' kernel parameter handling

   - After much discussion, David Laight's 'minmax: Relax type checks in
     min() and max()' is here. Hopefully reduces some typecasting and
     the use of min_t() and max_t()

   - A group of patches from Oleg Nesterov which clean up and slightly
     fix our handling of reads from /proc/PID/task/... and which remove
     task_struct.thread_group"

* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
  scripts/gdb/vmalloc: disable on no-MMU
  scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
  .mailmap: add address mapping for Tomeu Vizoso
  mailmap: update email address for Claudiu Beznea
  tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
  .mailmap: map Benjamin Poirier's address
  scripts/gdb: add lx_current support for riscv
  ocfs2: fix a spelling typo in comment
  proc: test ProtectionKey in proc-empty-vm test
  proc: fix proc-empty-vm test with vsyscall
  fs/proc/base.c: remove unneeded semicolon
  do_io_accounting: use sig->stats_lock
  do_io_accounting: use __for_each_thread()
  ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
  ocfs2: fix a typo in a comment
  scripts/show_delta: add __main__ judgement before main code
  treewide: mark stuff as __ro_after_init
  fs: ocfs2: check status values
  proc: test /proc/${pid}/statm
  compiler.h: move __is_constexpr() to compiler.h
  ...
2023-11-02 20:53:31 -10:00
Linus Torvalds
ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds
bc3012f4e3 This update includes the following changes:
API:
 
 - Add virtual-address based lskcipher interface.
 - Optimise ahash/shash performance in light of costly indirect calls.
 - Remove ahash alignmask attribute.
 
 Algorithms:
 
 - Improve AES/XTS performance of 6-way unrolling for ppc.
 - Remove some uses of obsolete algorithms (md4, md5, sha1).
 - Add FIPS 202 SHA-3 support in pkcs1pad.
 - Add fast path for single-page messages in adiantum.
 - Remove zlib-deflate.
 
 Drivers:
 
 - Add support for S4 in meson RNG driver.
 - Add STM32MP13x support in stm32.
 - Add hwrng interface support in qcom-rng.
 - Add support for deflate algorithm in hisilicon/zip.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmVB3vgACgkQxycdCkmx
 i6dsOBAAykbnX8BpnpnOXYywE9ZWrl98rAk51MK0N9olZNfg78zRPIv7fFxFdC20
 SDJrDSNPmn0Qvaa5e0EfoAdklsm0k2GkXL/BwPKMKWUsyIoJVYI3WrBMnjBy9xMp
 yfME+h0bKoXJCZKnYkIUSGUejmUPSyRlEylrXoFlH/VWYwAaii/x9zwreQoF+0LR
 KI24A1q8AYs6Dw9HSfndaAub9GOzrqKYs6fSaMG+77Y4UC5aoi5J9Bp2G3uVyHay
 x/0bZtIxKXS9wn+LeG/3GspX23x/I5VwBOdAoMigrYmAIaIg5qgyMszudltTAs4R
 zF1Kh7WsnM5+vpnBSeigzo+/GGOU3QTz8y3tBTg+3ZR7GWGOwQLiizhOYqCyOfAH
 pIm6c++sZw/OOHiL69Nt4HeLKzGNYYWk3s4X/B/6cqoouPfOsfBaQobZNx9zfy7q
 ZNEvSVBjrFX/L6wDSotny1LTWLUNjHbmLaMV5uQZ/SQKEtv19fp2Dl7SsLkHH+3v
 ldOAwfoJR6QcSwz3Ez02TUAvQhtP172Hnxi7u44eiZu2aUboLhCFr7aEU6kVdBCx
 1rIRVHD1oqlOEDRwPRXzhF3I8R4QDORJIxZ6UUhg7yueuI+XCGDsBNC+LqBrBmSR
 IbdjqmSDUBhJyM5yMnt1VFYhqKQ/ZzwZ3JQviwW76Es9pwEIolM=
 =IZmR
 -----END PGP SIGNATURE-----

Merge tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Add virtual-address based lskcipher interface
   - Optimise ahash/shash performance in light of costly indirect calls
   - Remove ahash alignmask attribute

  Algorithms:
   - Improve AES/XTS performance of 6-way unrolling for ppc
   - Remove some uses of obsolete algorithms (md4, md5, sha1)
   - Add FIPS 202 SHA-3 support in pkcs1pad
   - Add fast path for single-page messages in adiantum
   - Remove zlib-deflate

  Drivers:
   - Add support for S4 in meson RNG driver
   - Add STM32MP13x support in stm32
   - Add hwrng interface support in qcom-rng
   - Add support for deflate algorithm in hisilicon/zip"

* tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits)
  crypto: adiantum - flush destination page before unmapping
  crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place
  Documentation/module-signing.txt: bring up to date
  module: enable automatic module signing with FIPS 202 SHA-3
  crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures
  crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support
  crypto: FIPS 202 SHA-3 register in hash info for IMA
  x509: Add OIDs for FIPS 202 SHA-3 hash and signatures
  crypto: ahash - optimize performance when wrapping shash
  crypto: ahash - check for shash type instead of not ahash type
  crypto: hash - move "ahash wrapping shash" functions to ahash.c
  crypto: talitos - stop using crypto_ahash::init
  crypto: chelsio - stop using crypto_ahash::init
  crypto: ahash - improve file comment
  crypto: ahash - remove struct ahash_request_priv
  crypto: ahash - remove crypto_ahash_alignmask
  crypto: gcm - stop using alignmask of ahash
  crypto: chacha20poly1305 - stop using alignmask of ahash
  crypto: ccm - stop using alignmask of ahash
  net: ipv6: stop checking crypto_ahash_alignmask
  ...
2023-11-02 16:15:30 -10:00
Andreas Gruenbacher
92099f0c92 gfs2: Add metapath_dibh helper
Add a metapath_dibh() helper for extracting the inode's buffer head from
a metapath.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-11-02 20:10:00 +01:00