Commit Graph

90663 Commits

Author SHA1 Message Date
Kent Overstreet
ca1e02f7e9 bcachefs: Etyzinger cleanups
Pull out eytzinger.c and kill eytzinger_cmp_fn. We now provide
eytzinger0_sort and eytzinger0_sort_r, which use the standard cmp_func_t
and cmp_r_func_t callbacks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
bdbf953b3c bcachefs: bch2_shoot_down_journal_keys()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
27fcec6c27 bcachefs: Clear recovery_passes_required as they complete without errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Linus Torvalds
c85af715ca vboxsf fixes for v6.9-1
Highlights:
 - Compiler warning fixes
 - Explicitly deny setlease attempts
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmYNemEUHGhkZWdvZWRl
 QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9wc1Af/fqfUUusaYW408D3PukjUaOVF+0wo
 6wluwCxy/DEMBxIQbGACwYoQuULHkgyK5chcEZvdB56vullqePCwOKeJUeKs75MR
 HzG9NLs2qIN9WJ6cSHTQlBzvVIK7WV64BDtauD8FH3Afa5c5ojr1JqEAxebnlonI
 cmFUm5x1TlMQryXcY8rPU9sdeaowlNiE/g7qRNqRfsjCGz2zWJdtjskf8YjOY5yB
 KqulZnye04dEb6Wp8fGuNWauUAJ6gTwSJxlcPU0oHv+fRaYebnqTZZaJrg5kKF4a
 SF4llaPM3d714udHOZP3Ro2K+SRoj5jUNSfO7jxNNk6DZ4xB47iXqNJ/Sw==
 =HDgT
 -----END PGP SIGNATURE-----

Merge tag 'vboxsf-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux

Pull vboxsf fixes from Hans de Goede:

 - Compiler warning fixes

 - Explicitly deny setlease attempts

* tag 'vboxsf-v6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
  vboxsf: explicitly deny setlease attempts
  vboxsf: Remove usage of the deprecated ida_simple_xx() API
  vboxsf: Avoid an spurious warning if load_nls_xxx() fails
  vboxsf: remove redundant variable out_len
2024-04-03 10:30:52 -07:00
Roberto Sassu
701b38995e security: Place security_path_post_mknod() where the original IMA call was
Commit 08abce60d6 ("security: Introduce path_post_mknod hook")
introduced security_path_post_mknod(), to replace the IMA-specific call
to ima_post_path_mknod().

For symmetry with security_path_mknod(), security_path_post_mknod() was
called after a successful mknod operation, for any file type, rather
than only for regular files at the time there was the IMA call.

However, as reported by VFS maintainers, successful mknod operation does
not mean that the dentry always has an inode attached to it (for
example, not for FIFOs on a SAMBA mount).

If that condition happens, the kernel crashes when
security_path_post_mknod() attempts to verify if the inode associated to
the dentry is private.

Move security_path_post_mknod() where the ima_post_path_mknod() call was,
which is obviously correct from IMA/EVM perspective. IMA/EVM are the only
in-kernel users, and only need to inspect regular files.

Reported-by: Steve French <smfrench@gmail.com>
Closes: https://lore.kernel.org/linux-kernel/CAH2r5msAVzxCUHHG8VKrMPUKQHmBpE6K9_vjhgDa1uAvwx4ppw@mail.gmail.com/
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 08abce60d6 ("security: Introduce path_post_mknod hook")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-03 10:21:32 -07:00
Jeff Layton
1ece2c43b8 vboxsf: explicitly deny setlease attempts
vboxsf does not break leases on its own, so it can't properly handle the
case where the hypervisor changes the data. Don't allow file leases on
vboxsf.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240319-setlease-v1-1-5997d67e04b3@kernel.org
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:06:39 +02:00
Christophe JAILLET
0141d68f86 vboxsf: Remove usage of the deprecated ida_simple_xx() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

This is less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/b3c057c86b73f0309a6362031d21f4d7ebb60587.1698835730.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:06:11 +02:00
Christophe JAILLET
de3f64b738 vboxsf: Avoid an spurious warning if load_nls_xxx() fails
If an load_nls_xxx() function fails a few lines above, the 'sbi->bdi_id' is
still 0.
So, in the error handling path, we will call ida_simple_remove(..., 0)
which is not allocated yet.

In order to prevent a spurious "ida_free called for id=0 which is not
allocated." message, tweak the error handling path and add a new label.

Fixes: 0fd1695766 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/d09eaaa4e2e08206c58a1a27ca9b3e81dc168773.1698835730.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 16:05:51 +02:00
Colin Ian King
0200ceed30 vboxsf: remove redundant variable out_len
The variable out_len is being used to accumulate the number of
bytes but it is not being used for any other purpose. The variable
is redundant and can be removed.

Cleans up clang scan build warning:
fs/vboxsf/utils.c:443:9: warning: variable 'out_len' set but not
used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240229225138.351909-1-colin.i.king@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-04-03 15:55:33 +02:00
Kent Overstreet
fa14b50460 bcachefs: ratelimit informational fsck errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 20:24:00 -04:00
Kent Overstreet
7ee88737ab bcachefs: Check for bad needs_discard before doing discard
In the discard worker, we were failing to validate the bucket state -
meaning a corrupt needs_discard btree could cause us to discard a bucket
that we shouldn't.

If check_alloc_info hasn't run yet we just want to bail out, otherwise
it's a filesystem inconsistent error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 20:24:00 -04:00
Kent Overstreet
e0319af2b6 bcachefs: Improve bch2_btree_update_to_text()
Print out the mode as a string, and also print out the btree and
watermark.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 17:13:46 -04:00
Guenter Roeck
97ca7c1f93 mean_and_variance: Drop always failing tests
mean_and_variance_test_2 and mean_and_variance_test_4 always fail.
The input parameters to those tests are identical to the input parameters
to tests 1 and 3, yet the expected result for tests 2 and 4 is different
for the mean and stddev tests. That will always fail.

     Expected mean_and_variance_get_mean(mv) == mean[i], but
        mean_and_variance_get_mean(mv) == 22 (0x16)
        mean[i] == 10 (0xa)

Drop the bad tests.

Fixes: 65bc410907 ("mean and variance: More tests")
Closes: https://lore.kernel.org/lkml/065b94eb-6a24-4248-b7d7-d3212efb4787@roeck-us.net/
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 14:45:08 -04:00
Boris Burkov
6e68de0bb0 btrfs: always clear PERTRANS metadata during commit
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:19:13 +02:00
Boris Burkov
3c6f0c5ecc btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around.  This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.

Helps with the pass rate of generic/269 and generic/475.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:19:08 +02:00
Boris Burkov
211de93367 btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
The transaction is only able to free PERTRANS reservations for a root
once that root has been recorded with the TRANS tag on the roots radix
tree. Therefore, until we are sure that this root will get tagged, it
isn't safe to convert. Generally, this is not an issue as *some*
transaction will likely tag the root before long and this reservation
will get freed in that transaction, but technically it could stick
around until unmount and result in a warning about leaked metadata
reservation space.

This path is most exercised by running the generic/269 fstest with
CONFIG_BTRFS_DEBUG.

Fixes: a649684967 ("btrfs: fix start transaction qgroup rsv double free")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:47 +02:00
Boris Burkov
71537e35c3 btrfs: record delayed inode root in transaction
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.

Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:33 +02:00
Boris Burkov
74e9795812 btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.

However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.

The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:

1. any failure before record_root_in_trans is called successfully
   results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
   transaction owns freeing the reservation.

This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.

Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:23 +02:00
Boris Burkov
141fb8cd20 btrfs: qgroup: correctly model root qgroup rsv in convert
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.

Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.

Fixes: 8287475a20 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:04 +02:00
Paulo Alcantara
93cee45ccf smb: client: serialise cifs_construct_tcon() with cifs_mount_mutex
Serialise cifs_construct_tcon() with cifs_mount_mutex to handle
parallel mounts that may end up reusing the session and tcon created
by it.

Cc: stable@vger.kernel.org # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:12:22 -05:00
Paulo Alcantara
4a5ba0e0bf smb: client: handle DFS tcons in cifs_construct_tcon()
The tcons created by cifs_construct_tcon() on multiuser mounts must
also be able to failover and refresh DFS referrals, so set the
appropriate fields in order to get a full DFS tcon.  They could be
shared among different superblocks later, too.

Cc: stable@vger.kernel.org # 6.4+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404021518.3Xu2VU4s-lkp@intel.com/
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:12:13 -05:00
Paulo Alcantara
0a05ad21d7 smb: client: refresh referral without acquiring refpath_lock
Avoid refreshing DFS referral with refpath_lock acquired as the I/O
could block for a while due to a potentially disconnected or slow DFS
root server and then making other threads - that use same @server and
don't require a DFS root server - unable to make any progress.

Cc: stable@vger.kernel.org # 6.4+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:11:43 -05:00
Paulo Alcantara
062a7f0ff4 smb: client: guarantee refcounted children from parent session
Avoid potential use-after-free bugs when walking DFS referrals,
mounting and performing DFS failover by ensuring that all children
from parent @tcon->ses are also refcounted.  They're all needed across
the entire DFS mount.  Get rid of @tcon->dfs_ses_list while we're at
it, too.

Cc: stable@vger.kernel.org # 6.4+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404021527.ZlRkIxgv-lkp@intel.com/
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 10:09:57 -05:00
David Howells
e9e62243a3 cifs: Fix caching to try to do open O_WRONLY as rdwr on server
When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:29:55 -05:00
Namjae Jeon
5ed11af19e ksmbd: do not set SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
SMB2_GLOBAL_CAP_ENCRYPTION flag should be used only for 3.0 and
3.0.2 dialects. This flags set cause compatibility problems with
other SMB clients.

Reported-by: James Christopher Adduono <jc@adduono.com>
Tested-by: James Christopher Adduono <jc@adduono.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:21:25 -05:00
Namjae Jeon
a677ebd8ca ksmbd: validate payload size in ipc response
If installing malicious ksmbd-tools, ksmbd.mountd can return invalid ipc
response to ksmbd kernel server. ksmbd should validate payload size of
ipc response from ksmbd.mountd to avoid memory overrun or
slab-out-of-bounds. This patch validate 3 ipc response that has payload.

Cc: stable@vger.kernel.org
Reported-by: Chao Ma <machao2019@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:21:01 -05:00
Namjae Jeon
c1832f6703 ksmbd: don't send oplock break if rename fails
Don't send oplock break if rename fails. This patch fix
smb2.oplock.batch20 test.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-02 09:20:35 -05:00
Kent Overstreet
c42cd606e4 bcachefs: fix nocow lock deadlock
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-02 01:04:10 -04:00
Kent Overstreet
e2a316b3cc bcachefs: BCH_WATERMARK_interior_updates
This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.

One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 21:14:02 -04:00
Kent Overstreet
ba947ecd39 bcachefs: Fix btree node reserve
Sign error when checking the watermark - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 21:14:02 -04:00
Paulo Alcantara
24a9799aa8 smb: client: fix UAF in smb2_reconnect_server()
The UAF bug is due to smb2_reconnect_server() accessing a session that
is already being teared down by another thread that is executing
__cifs_put_smb_ses().  This can happen when (a) the client has
connection to the server but no session or (b) another thread ends up
setting @ses->ses_status again to something different than
SES_EXITING.

To fix this, we need to make sure to unconditionally set
@ses->ses_status to SES_EXITING and prevent any other threads from
setting a new status while we're still tearing it down.

The following can be reproduced by adding some delay to right after
the ipc is freed in __cifs_put_smb_ses() - which will give
smb2_reconnect_server() worker a chance to run and then accessing
@ses->ipc:

kinit ...
mount.cifs //srv/share /mnt/1 -o sec=krb5,nohandlecache,echo_interval=10
[disconnect srv]
ls /mnt/1 &>/dev/null
sleep 30
kdestroy
[reconnect srv]
sleep 10
umount /mnt/1
...
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
general protection fault, probably for non-canonical address
0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc2 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
Workqueue: cifsiod smb2_reconnect_server [cifs]
RIP: 0010:__list_del_entry_valid_or_report+0x33/0xf0
Code: 4f 08 48 85 d2 74 42 48 85 c9 74 59 48 b8 00 01 00 00 00 00 ad
de 48 39 c2 74 61 48 b8 22 01 00 00 00 00 74 69 <48> 8b 01 48 39 f8 75
7b 48 8b 72 08 48 39 c6 0f 85 88 00 00 00 b8
RSP: 0018:ffffc900001bfd70 EFLAGS: 00010a83
RAX: dead000000000122 RBX: ffff88810da53838 RCX: 6b6b6b6b6b6b6b6b
RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc02f6878 RDI: ffff88810da53800
RBP: ffff88810da53800 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88810c064000
R13: 0000000000000001 R14: ffff88810c064000 R15: ffff8881039cc000
FS: 0000000000000000(0000) GS:ffff888157c00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe3728b1000 CR3: 000000010caa4000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? die_addr+0x36/0x90
 ? exc_general_protection+0x1c1/0x3f0
 ? asm_exc_general_protection+0x26/0x30
 ? __list_del_entry_valid_or_report+0x33/0xf0
 __cifs_put_smb_ses+0x1ae/0x500 [cifs]
 smb2_reconnect_server+0x4ed/0x710 [cifs]
 process_one_work+0x205/0x6b0
 worker_thread+0x191/0x360
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xe2/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x34/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-01 13:39:20 -05:00
Andrey Albershteyn
e23d7e82b7 xfs: allow cross-linking special files without project quota
There's an issue that if special files is created before quota
project is enabled, then it's not possible to link this file. This
works fine for normal files. This happens because xfs_quota skips
special files (no ioctls to set necessary flags). The check for
having the same project ID for source and destination then fails as
source file doesn't have any ID.

mkfs.xfs -f /dev/sda
mount -o prjquota /dev/sda /mnt/test

mkdir /mnt/test/foo
mkfifo /mnt/test/foo/fifo1

xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> Setting up project 9 (path /mnt/test/foo)...
> xfs_quota: skipping special file /mnt/test/foo/fifo1
> Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).

ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link

mkfifo /mnt/test/foo/fifo2
ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link

Fix this by allowing linking of special files to the project quota
if special files doesn't have any ID set (ID = 0).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-01 11:55:49 +05:30
Kent Overstreet
b3c7fd35c0 bcachefs: On emergency shutdown, print out current journal sequence number
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 01:07:24 -04:00
Kent Overstreet
eab3a3ce2d bcachefs: Fix overlapping extent repair
overlapping extent repair was colliding with extent past end of inode
checks - don't update "extent ends at" until we know we have an extent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 01:05:50 -04:00
Kent Overstreet
8ce1db8091 bcachefs: Fix remove_dirent()
We were missing an iter_traverse().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 00:52:32 -04:00
Kent Overstreet
cecfed9b44 bcachefs: Logged op errors should be ignored
If something is wrong with a logged op, we just want to delete it -
there's nothing to repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-01 00:04:10 -04:00
Kent Overstreet
13c1e583f9 bcachefs: Improve -o norecovery; opts.recovery_pass_limit
This adds opts.recovery_pass_limit, and redoes -o norecovery to make use
of it; this fixes some issues with -o norecovery so it can be safely
used for data recovery.

Norecovery means "don't do journal replay"; it's an important data
recovery tool when we're getting stuck in journal replay.

When using it this way we need to make sure we don't free journal keys
after startup, so we continue to overlay them: thus it needs to imply
retain_recovery_info, as well as nochanges.

recovery_pass_limit is an explicit option for telling recovery to exit
after a specific recovery pass; this is a much cleaner way of
implementing -o norecovery, as well as being a useful debug feature in
its own right.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
060ff30a85 bcachefs: bch2_run_explicit_recovery_pass_persistent()
Flag that we need to run a recovery pass and run it - persistenly, so if
we crash it'll still get run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
0a34c058fc bcachefs: Ensure bch_sb_field_ext always exists
This makes bch_sb_field_ext more consistent with the rest of -o
nochanges - we don't want to be varying other codepaths based on -o
nochanges, since it's used for testing in dry run mode; also fixes some
potential null ptr derefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
4fe0eeeae4 bcachefs: Flush journal immediately after replay if we did early repair
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
af855a5f5e bcachefs: Resume logged ops after fsck
Finishing logged ops requires the filesystem to be in a reasonably
consistent state - and other fsck passes don't require it to have
completed, so just run it last.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:12 -04:00
Kent Overstreet
e5aa804641 bcachefs: Add error messages to logged ops fns
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
d2554263ad bcachefs: Split out recovery_passes.c
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
11d5568d3e bcachefs: fix backpointer for missing alloc key msg
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
7f9e508036 bcachefs: Fix bch2_btree_increase_depth()
When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
47d2080e30 bcachefs: Kill bch2_bkey_ptr_data_type()
Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
dcc1c04587 bcachefs: Fix use after free in check_root_trans()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
83bb585390 bcachefs: Fix repair path for missing indirect extents
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
6f5869ffd9 bcachefs: Fix use after free in bch2_check_fix_ptrs()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
812a929793 bcachefs: Fix btree node keys accounting in topology repair path
When dropping keys now outside a now because we're changing the node
min/max, we need to redo the node's accounting as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
805b535a8a bcachefs: Check btree ptr min_key in .invalid
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
zhuxiaohui
bb66009958 bcachefs: add REQ_SYNC and REQ_IDLE in write dio
when writing file with direct_IO on bcachefs, then performance is
much lower than other fs due to write back throttle in block layer:

        wbt_wait+1
        __rq_qos_throttle+32
        blk_mq_submit_bio+394
        submit_bio_noacct_nocheck+649
        bch2_submit_wbio_replicas+538
        __bch2_write+2539
        bch2_direct_write+1663
        bch2_write_iter+318
        aio_write+355
        io_submit_one+1224
        __x64_sys_io_submit+169
        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+110

add set REQ_SYNC and REQ_IDLE in bio->bi_opf as standard dirct-io

Signed-off-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
79032b0781 bcachefs: Improved topology repair checks
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().

This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.

Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.

Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.

In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
40cb26233a bcachefs: Be careful about btree node splits during journal replay
Don't pick a pivot that's going to be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
048f47e83f bcachefs: btree_and_journal_iter now respects trans->journal_replay_not_finished
btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Hongbo Li
36f9ef109b bcachefs: fix trans->mem realloc in __bch2_trans_kmalloc
The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.

Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.

Fixes: 31403dca5b ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
57339b24a0 bcachefs: Don't do extent merging before journal replay is finished
We don't normally do extent updates this early in recovery, but some of
the repair paths have to and when we do, we don't want to do anything
that requires the snapshots table.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
ec9cc18fc2 bcachefs: Add checks for invalid snapshot IDs
Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.

This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
63332394c7 bcachefs: Move snapshot table size to struct snapshot_table
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
aa6e130e3c bcachefs: Add an assertion for trying to evict btree root
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
4bd02d3fb3 bcachefs: fix mount error path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Thomas Bertschinger
688d750d10 bcachefs: fix misplaced newline in __bch2_inode_unpacked_to_text()
before:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0

  bi_version=0bi_atime=227064388944
  ...

after:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0
  bi_version=0
  bi_atime=227064388944
  ...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Kent Overstreet
8aad8e1f65 bcachefs: Fix journal pins in btree write buffer
btree write buffer flush has two phases
 - in natural key order, which is more efficient but may fail
 - then in journal order

The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Kent Overstreet
a5e3dce493 bcachefs: Fix assert in bch2_backpointer_invalid()
Backpointers that point to invalid devices are caught by fsck, not
.key_invalid; so .key_invalid needs to check for them instead of hitting
asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:10 -04:00
Justin Stitt
ebd9779683 smb: client: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

In cifssmb.c:
Using strncpy with a length argument equal to strlen(src) is generally
dangerous because it can cause string buffers to not be NUL-terminated.
In this case, however, there was extra effort made to ensure the buffer
was NUL-terminated via a manual NUL-byte assignment. In an effort to rid
the kernel of strncpy() use, let's swap over to using strscpy() which
guarantees NUL-termination on the destination buffer.

To handle the case where ea_name is NULL, let's use the ?: operator to
substitute in an empty string, thereby allowing strscpy to still
NUL-terminate the destintation string.

Interesting note: this flex array buffer may go on to also have some
value encoded after the NUL-termination:
|	if (ea_value_len)
|		memcpy(parm_data->list.name + name_len + 1,
|			ea_value, ea_value_len);

Now for smb2ops.c and smb2transport.c:
Both of these cases are simple, strncpy() is used to copy string
literals which have a length less than the destination buffer's size. We
can simply swap in the new 2-argument version of strscpy() introduced in
Commit e6584c3964 ("string: Allow 2-argument strscpy()").

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-31 17:35:14 -05:00
Linus Torvalds
7e40c2100c Kbuild fixes for v6.9
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
 
  - Fix unselectable choice entry in MIPS Kconfig, and forbid this
    structure
 
  - Remove unused include/asm-generic/export.h
 
  - Fix a NULL pointer dereference bug in modpost
 
  - Enable -Woverride-init warning consistently with W=1
 
  - Drop KCSAN flags from *.mod.c files
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
 1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
 x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
 dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
 qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
 Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
 OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
 mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
 CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
 gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
 MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
 zYygkhvvtiH9/STu
 =tx+2
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Deduplicate Kconfig entries for CONFIG_CXL_PMU

 - Fix unselectable choice entry in MIPS Kconfig, and forbid this
   structure

 - Remove unused include/asm-generic/export.h

 - Fix a NULL pointer dereference bug in modpost

 - Enable -Woverride-init warning consistently with W=1

 - Drop KCSAN flags from *.mod.c files

* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: Fix typo HEIGTH to HEIGHT
  Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
  kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
  kbuild: make -Woverride-init warnings more consistent
  modpost: do not make find_tosym() return NULL
  export.h: remove include/asm-generic/export.h
  kconfig: do not reparent the menu inside a choice block
  MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
  cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
2024-03-31 11:23:51 -07:00
Arnd Bergmann
c40845e319 kbuild: make -Woverride-init warnings more consistent
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.

Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.

There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.

Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.

Fixes: 2cd3271b7a ("kbuild: avoid duplicate warning options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-31 11:32:26 +09:00
Linus Torvalds
712e14250d Bug fixes for 6.9-rc2:
* Allow stripe unit/width value passed via mount option to be written over
    existing values in the super block.
  * Do not set current->journal_info to avoid its value from being miused by
    another filesystem context.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZgKa+AAKCRAH7y4RirJu
 9IL1APwPBMzSowijBI/rCD5BGlISn7mCRlZwvyXE1avmRmbQPAEApU5yRhBHWi62
 629azfSr1I5m678xM7WQKh6X3/VUDAg=
 =pqNH
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Allow stripe unit/width value passed via mount option to be written
   over existing values in the super block

 - Do not set current->journal_info to avoid its value from being miused
   by another filesystem context

* tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't use current->journal_info
  xfs: allow sunit mount option to repair bad primary sb stripe values
2024-03-30 13:51:58 -07:00
Linus Torvalds
091619baac 2 cifs.ko changesets
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYG6B4ACgkQiiy9cAdy
 T1H8vAv/ULRl/PQreMq6FaXSeW0S5w7RiMeeZ1+314C7W/iLLbd8n6fmOshAS1nn
 d8vC6qx5ulTNRwFR7eYexolWgqG7bisKTiGzp+UZelo01LlPU90cJQvPZfaETVPR
 XpqRUwrQE/sU9n48PGvcGARqd2sz1WC3l/oeN1P0QXJ7xWKJJNcLqtzT0eeNgAto
 u/EKw9SZ5D20V1GrGYxHY57M9L0rRcndUoWZDn9s0NCkOGnibOOiTxpQV8sepJ35
 6CRKxLTatzAJahUaQ/C5uvM0cSfRcTQdp3u4B01+gCSPhP7mwIS0o5EDzw2lWk6W
 9zQz9de2PPw98Q0sbZn1fuherbLSKDXZIIUKEH+oZ2WXJfva5jIVBpp5EIR92VY4
 A86DznmRJC5br3q+FuBOc1NPyslV01l80ZmhWppeX6/YA8egkm4KWgFB/02n7Agh
 Ug2Fe8MNGVPiwmijXBswf9CSjn3ctAH9cRvqs0QXj3aPkduvu+jDVGwbO/zWdvlY
 c1gs/eF7
 =ai8C
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Add missing trace point (noticed when debugging the recent mknod LSM
   regression)

 - fscache fix

* tag '6.9-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix duplicate fscache cookie warnings
  smb3: add trace event for mknod
2024-03-29 12:06:09 -07:00
Linus Torvalds
d8e8fbec00 nfsd-6.9 fixes:
- Address three recently introduced regressions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYFtQkACgkQM2qzM29m
 f5fSbA//S0YiPCU+bVwN+mK47vhai1Dzaw1kd2rzlfnvTCyBFVocfQU85bFV8uns
 ZMcHhO6jPjRg1o/qkCzeNf+n6AVGwvgOn1jC628jTP1wJOS8WPsZtdwz8vHX/xw5
 qnMwEUtfFPiDn85VyK6z6mJBWRcF/vStPc022Ie5ba5ydxD0rFmaeIc6zYX8csUe
 8tbHlulYtRR97mXjxgtXC/H0scvag07MA1fRKiZRNk0jwUdtcAsHktiBfWPV2KXO
 ew8saA+BLyBBdE01cq1CoMH/3wt/a5rpaiV5ePPtYkmlPVT5IEmSLZUwUVZAjMuu
 qlvKF9xrH7x8/7g/SwOJdH8po0IjiRdW9EKID+6PPqXz+YYKyTe7JqZtvEzt/4x9
 tGbIfyHxpUPPTZ6VedlXYW95uWG+lrKzur+zHQwsNf+U0d5Xila8euEKqvB4kDtk
 riHUlvs8O6WroeHcOQ5pK72l7x8gELdVqHP6n+E0t5+2VKp/Vcqf8WlWV0suoDfa
 D2EYGIzxn8NrNuivGESHfzafmHKKMn51UGrm1Sl0vF5w2LlqMfL4Yo4NIT+e69VR
 E3h6wIhK7Jor9px5/tBD3Y6o9k8wT9upjA+qsrOUhxyU3T9r+s88I64gvGyABTIM
 kZJHLJYROoNO6A5FSRySSH+bk/UU+wJ0s6avgHL1A2yYnyrQUzU=
 =YV5O
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Address three recently introduced regressions

* tag 'nfsd-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
  SUNRPC: Revert 561141dd49
  nfsd: Fix error cleanup path in nfsd_rename()
2024-03-28 14:35:32 -07:00
Jeff Layton
7a84602297
9p: explicitly deny setlease attempts
9p is a remote network protocol, and it doesn't support asynchronous
notifications from the server. Ensure that we don't hand out any leases
since we can't guarantee they'll be broken when a file's contents
change.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 19:52:55 +00:00
Joakim Sindholt
4e5d208cc9
fs/9p: fix the cache always being enabled on files with qid flags
I'm not sure why this check was ever here. After updating to 6.6 I
suddenly found caching had been turned on by default and neither
cache=none nor the new directio would turn it off. After walking through
the new code very manually I realized that it's because the caching has
to be, in effect, turned off explicitly by setting P9L_DIRECT and
whenever a file has a flag, in my case QTAPPEND, it doesn't get set.

Setting aside QTDIR which seems to ignore the new fid->mode entirely,
the rest of these either should be subject to the same cache rules as
every other QTFILE or perhaps very explicitly not cached in the case of
QTAUTH.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:29 +00:00
Joakim Sindholt
87de39e705
fs/9p: translate O_TRUNC into OTRUNC
This one hits both 9P2000 and .u as it appears v9fs has never translated
the O_TRUNC flag.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:28 +00:00
Joakim Sindholt
cd25e15e57
fs/9p: only translate RWX permissions for plain 9P2000
Garbage in plain 9P2000's perm bits is allowed through, which causes it
to be able to set (among others) the suid bit. This was presumably not
the intent since the unix extended bits are handled explicitly and
conditionally on .u.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 13:59:23 +00:00
Linus Torvalds
8d025e2092 Changes since last update:
- Add a new reviewer Sandeep Dhavale to build a healthier community;
 
  - Drop experimental warning for FSDAX.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmYE1ucRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qoSnw//Q35Sc7fIQb6YfuOOQ2VQX7TVE4jTrLej
 +3hIQUz3BA8wRwrBYdeJLcvjda4y+0gNxlpw2ycNS6S4vxAiVEAuwPJYO6oAC0Il
 ETa6opc1vpTMeXgwwtbXC7ACnWas9EQC61Z4E8W5zeVcNqQZnZyInMJ9Rkjqs/iJ
 VjJH2wWR5MIgWJdEPqchPx/28nqbBOcztc7ARJqpujyZEvu+OBVIPv8P/7N0a5yG
 bpHkDrzoelBMkpktpuvrkv84ymyCDC7LH9mq+Wk6dY5wRFxa2BtoZVh6YcpcjrpR
 75GZVg3BN3Ph41JCwYHqxyRGpoLO11dSYi6DNDVngxOGtkTNRVGrJ1FYEapWV+o8
 1MnEbl0vZSHUkjrIFbfZTFSqpvW2XSfEOa3heNDFknmzT/ISobSGENULp9cggcYI
 jhS6wtVG4bl5bCiCKCZluByr8/J8TCQc/5t5f5bQLy2MWqlyjaSx82uWuDpzO1Hh
 +q+p+MB+ketMUwxaIUAuTNgzhFPFT4Na/ni9WqP7Ri3GJY6pdjDUUtrtoIBK4oPQ
 ajUWhPlOk5zMwLq9Jl4MiG1ostBI9P37ZerjsdaLDZYElGhTjwjPu/xlh7p26Inq
 Ufq3QaQH2wai+oAVS6Sli3dJfb399XVJhmT2WFMH+0DmQW6JzvsGTdUX2fMgv5sb
 I7dVfuceTs0=
 =+ZW9
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:

 - Add a new reviewer Sandeep Dhavale to build a healthier community

 - Drop experimental warning for FSDAX

* tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  MAINTAINERS: erofs: add myself as reviewer
  erofs: drop experimental warning for FSDAX
2024-03-27 20:24:09 -07:00
Linus Torvalds
4076fa1612 fs/9p: fix errors found in 6.9-rc1
Two of these fix syzbot reported issues, and
 the other fixes a unused variable in some configurations.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYEjyAACgkQiP/V+0pf
 /5jrDhAAkrEkFry9Zp7txMNG7fyRVRW8BPkvCUbORBdmLm01lswTnvHDT5pbSYcf
 Z3KxSpo0DvO2G3XEHQHQN24Sbx78Xg8XssXGFc0j1/hQbGpwIyXM/NTa3VVnfmwH
 lV2ysLa5zCR81k0hu8QzGe8DSOFsyE9oz3ABE7nCjmQsOI0zgyOvn7Uy1rwT5B11
 lBYN4rotl2Ie1wPoVTf8PNbeAWtuVR3FN9GTSvZuNDFbYjYEv0zFangxXIBFhn2g
 pUJ22C6CYenhLzTKBRVW/CcSxPVVS4jFok1nws07MmlXozrbklPYHuN6XQYFetLt
 wtnZxmUmy1u4pmpG2xPziUOGb7wdm6cDC1aIrcYPlbk/9U2iYHuVgQz0WOcEIbA3
 g5LrgCpUZO8UPxEvtFYDq9oUAhxhjf6BI7/rR0fNWp8JgJ0scz3d/e+Q/srWWr23
 Pej+10N5JijTEuD39BUnWy/yTz0WVBg1nP/C44buB0zSWJbxiPOJd0imSW6ipoIk
 Onr3FmhnRSiIfuUKEH/jI+QsqKzZTdBe9e3+SYBEg4KXvM3e1ltJgDs+BIdWbo00
 ih/jkZL7iX+uzkOdVR3o/1KagkwHTcU8P+4i8wELD7+CquQXvsedfqeTmz5JvNIR
 9Df1Gu1xCeINn6jnlSaVAonnCi1awJoaxFW+azCb+q5MP8lP6IU=
 =1ev6
 -----END PGP SIGNATURE-----

Merge tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
 "Two of these fix syzbot reported issues, and the other fixes a unused
  variable in some configurations"

* tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: fix uninitialized values during inode evict
  fs/9p: remove redundant pointer v9ses
  fs/9p: fix uaf in in v9fs_stat2inode_dotl
2024-03-27 14:53:56 -07:00
Linus Torvalds
400dd456bd for-6.9-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYEfJYACgkQxWXV+ddt
 WDucIg/+IupuqdLKnj6bepxX/VufnFAjecD3sRgZQQLIMfm3MQX3TzbNoPYEiAGU
 tNG6jxYgkGRoyhN3aIQnsJmRFje5epYjNA5+ueUNT2/KfyKonnS2TIKQt6u7XBls
 fl4SCTSNRX7w/QUNUWwyY5/86yzV4F8w19X5nVOKcp7Nz3hUBdeDZWAmMlYyHuFW
 N2YRyNdCxB4Y0U9g1vgI63wFjOac0F+7RTHGsDH7ueOZ2dbtDM38lHBoCbdX3jmy
 5nG7wVJZp2H/zCmzrVQJ897CMfr3h9r9Kxx8EE3JDJaJ5sMMaRh361rgsZTaGsjz
 SwUzT6Z0u0hsBANSTOUZixhfX5sqArmemG+XpFu6Rq+732DqS+c4vWRSu7c8Rc8i
 +4HIQNsjJqm/d1u2IyxXfuqSbaULLnyYQ8rdEx3o2AM37JnuTvOWoB+v/JqPb9TI
 aG+bOPvg7GM9Sl3IoM5sR+j3bEebranZbUF+UiDEujZJiY+uiw3vMbFvyOBRWaUU
 ODTpNoyCmz94mWg79hyosOjM9A/NCEkRH4oSc+YeqOvzTIBG3V+D3HxN/DX4FVTy
 VDxMdptu0aPIEkUQ3nvsj4t3OKgj1w9rxZFpRYH33zJVvRqZ8VrgtC9V6zgPv3h7
 suQL4s4i4EiIgAk2Z0OR23wDwg1TwXVWLGErfQVHslvhl/a2Qb4=
 =Lhhp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix race when reading extent buffer and 'uptodate' status is missed
   by one thread (introduced in 6.5)

 - do additional validation of devices using major:minor numbers

 - zoned mode fixes:
     - use zone-aware super block access during scrub
     - fix use-after-free during device replace (found by KASAN)
     - also delete zones that are 100% unusable to reclaim space

 - extent unpinning fixes:
     - fix extent map leak after error handling
     - print correct range in error message

 - error code and message updates

* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race in read_extent_buffer_pages()
  btrfs: return accurate error code on open failure in open_fs_devices()
  btrfs: zoned: don't skip block groups with 100% zone unusable
  btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
  btrfs: fix message not properly printing interval when adding extent map
  btrfs: fix warning messages not printing interval at unpin_extent_range()
  btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
  btrfs: validate device maj:min during open
  btrfs: zoned: fix use-after-free in do_zone_finish()
  btrfs: zoned: use zone aware sb location for scrub
2024-03-27 13:56:41 -07:00
Chuck Lever
99dc2ef039 NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies
There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6cc6 ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.

RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.

Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.

Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Fixes: e4469c6cc6 ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-27 13:19:47 -04:00
David Howells
8876a37277 cifs: Fix duplicate fscache cookie warnings
fscache emits a lot of duplicate cookie warnings with cifs because the
index key for the fscache cookies does not include everything that the
cifs_find_inode() function does.  The latter is used with iget5_locked() to
distinguish between inodes in the local inode cache.

Fix this by adding the creation time and file type to the fscache cookie
key.

Additionally, add a couple of comments to note that if one is changed the
other must be also.

Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-27 12:04:06 -05:00
Linus Torvalds
f4a432914a execve fixes for v6.9-rc2
- Fix selftests to conform to the TAP output format (Muhammad Usama Anjum)
 
 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
 
 - Replace deprecated strncpy usage (Justin Stitt)
 
 - Replace another /bin/sh instance in selftests
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYDT3sWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjxlD/49PYpA4hMReEJ/01UkMn7IT2DP
 QWV9IfaPTodj9tjQngalhcF7r6O5guRR7MRfZxyaXriq4aJNzOLm2STmwSG1cOgP
 hP9D0HnMSc5CrqMJ2kSTr3ETK0a2mTivWl375TUgGdW+QJo7YYInHYaH2THhme1Z
 MkLHqSkruHw6YVvSvzoWiwZ4taiia7op8HbAEvJQiwnJdiVeCLIYbf2AxXNop2xv
 xcmoGkSh6KSiQ0XQ7VXs4LC3v/ElHBINSbChoXPBDY5kBWZybyxRwYCVt8mJftgF
 mVGXBFFpnaLU/gDayPg/Pyq9sW1bLpi8w0BBu419BVfAQ475K+YZ/V8nj4fm95e3
 gIWm3x1O48r0OxdzmPb5re/s7lG5uNLzzFEWIus18NmqgA8S1CyFveRB3Zh8LlXB
 9UEt4mlcgp/CLAo1Zv6IBe6UDcAf4AR4Tq+d+etmORTqHmM7n399XivNuft9myyB
 9ObLCfKvOa71uF0n714XLHc5STk2KTK70Me2L/H5gitSqjIEKFNQ5SOaSbsGImDv
 i4YPnptCJFTQumE0Tu5hna8uyjOXFIxq/zkfDmzc1wP8FcijwRx3UPoO6WlQsdfx
 5cmJSaIX1bhFC+4gxAoEHUDWPh/f4kLeDpIXX6NPH28Do1wxLnri3ryvkfgkw5Vj
 /1E03LXfcnnSbjQAPQ==
 =Siss
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix selftests to conform to the TAP output format (Muhammad Usama
   Anjum)

 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)

 - Replace deprecated strncpy usage (Justin Stitt)

 - Replace another /bin/sh instance in selftests

* tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt: replace deprecated strncpy
  exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
  selftests/exec: Convert remaining /bin/sh to /bin/bash
  selftests/exec: execveat: Improve debug reporting
  selftests/exec: recursion-depth: conform test to TAP format output
  selftests/exec: load_address: conform test to TAP format output
  selftests/exec: binfmt_script: Add the overall result line according to TAP
2024-03-27 09:57:30 -07:00
Christian Brauner
22650a9982
fs,block: yield devices early
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.

Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 13:17:15 +01:00
Steve French
e9e9fbeb83 smb3: add trace event for mknod
Add trace points to help debug mknod and mkfifo:

   smb3_mknod_done
   smb3_mknod_enter
   smb3_mknod_err

Example output:

      TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
         | |         |   |||||     |         |
    mkfifo-6163    [003] .....   960.425558: smb3_mknod_enter: xid=12 sid=0xb55130f6 tid=0x46e6241c path=\fifo1
    mkfifo-6163    [003] .....   960.432719: smb3_mknod_done: xid=12 sid=0xb55130f6 tid=0x46e6241c

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-26 14:56:36 -05:00
Tavian Barnes
ef1e68236b btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.

To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:

    /* (1) */
    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

    /* (2) */
    if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
        goto done;

At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does

    /* (3) */
    set_extent_buffer_uptodate(eb);

    /* (4) */
    clear_bit(EXTENT_BUFFER_READING, &eb->bflags);

Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning.  Unfortunately, there is
a racey interleaving:

    Thread A | Thread B | Thread C
    ---------+----------+---------
       (1)   |          |
             |    (1)   |
       (2)   |          |
       (3)   |          |
       (4)   |          |
             |    (2)   |
             |          |    (1)

When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress.  This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:

    BTRFS critical (device dm-0): corrupted node, root=256
    block=8550954455682405139 owner mismatch, have 11858205567642294356
    expect [256, 18446744073709551360]

Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.

Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Anand Jain
2f1aeab9fc btrfs: return accurate error code on open failure in open_fs_devices()
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.

In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.

Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Johannes Thumshirn
a8b70c7f86 btrfs: zoned: don't skip block groups with 100% zone unusable
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.

So if a block group is 100% zone_unusable it is skipped from the deletion
step.

In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.

Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
2133460061 btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
379c872393 btrfs: fix message not properly printing interval when adding extent map
At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
4dc1d69c2b btrfs: fix warning messages not printing interval at unpin_extent_range()
At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.

Fixes: 00deaf04df ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Filipe Manana
8a565ec04d btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.

Fixes: c03c89f821 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Anand Jain
9f7eb8405d btrfs: validate device maj:min during open
Boris managed to create a device capable of changing its maj:min without
altering its device path.

Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.

Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.

To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.

CC: stable@vger.kernel.org # 6.7+
Fixes: a5b8a5f9f8 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <boris@bur.io>
Co-developed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>#
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Johannes Thumshirn
1ec17ef591 btrfs: zoned: fix use-after-free in do_zone_finish()
Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.

 BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
 ==================================================================
 BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
 Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007

 CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
 Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x90
  print_report+0xcf/0x670
  ? __virt_addr_valid+0x200/0x3e0
  kasan_report+0xd8/0x110
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  do_zone_finish+0x91a/0xb90 [btrfs]
  btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
  ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
  ? btrfs_put_root+0x2d/0x220 [btrfs]
  ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
  cleaner_kthread+0x21e/0x380 [btrfs]
  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
  kthread+0x2e3/0x3c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

 Allocated by task 3493983:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  __kasan_kmalloc+0xaa/0xb0
  btrfs_alloc_device+0xb3/0x4e0 [btrfs]
  device_list_add.constprop.0+0x993/0x1630 [btrfs]
  btrfs_scan_one_device+0x219/0x3d0 [btrfs]
  btrfs_control_ioctl+0x26e/0x310 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 Freed by task 3494056:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  kasan_save_free_info+0x3f/0x60
  poison_slab_object+0x102/0x170
  __kasan_slab_free+0x32/0x70
  kfree+0x11b/0x320
  btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
  btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
  btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
  btrfs_ioctl+0xb27/0x57d0 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 The buggy address belongs to the object at ffff8881543c8000
  which belongs to the cache kmalloc-1k of size 1024
 The buggy address is located 96 bytes inside of
  freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)

 The buggy address belongs to the physical page:
 page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
 head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
 page_type: 0xffffffff()
 raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().

The sequence of events is as follows:

btrfs_dev_replace_start
  btrfs_scrub_dev
   btrfs_dev_replace_finishing
    btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
    btrfs_rm_dev_replace_free_srcdev
     btrfs_free_device                              <-- device freed

cleaner_kthread
 btrfs_delete_unused_bgs
  btrfs_zone_finish
   do_zone_finish              <-- refers the freed device

The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.

The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.

Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.

Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.

Many thanks to Shinichiro for analyzing the bug.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: stable@vger.kernel.org # 6.8
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:41:01 +01:00
Linus Torvalds
928a87efa4 gfs2 fix
- Fix boundary check in punch_hole
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmYBaM0UHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrzqw/9GpK71h1dIA8vYqInumdrUabksLKy
 jRMR2ZxfzBKLdAfgn9AS3nrWNos72vjAxbjYCi/fbY9uvIK1/zzq7Ef7601kCetM
 NzxShY8AwLJa9mO8O5yReLL7O/61gjlcdD6rSjkYwphWuobd5vpudKkibgpdJyH8
 bn6U1/2K5ASFtWyTRbudOIsz4AqPUE6ZB4KxSuCDx7uFiQjnuh6sk8wfg48pdig7
 GAsNPmBFfWAQXClPnI/WFG0hpkuRIK1hk9ITWx1ybu2JqaNeVXRBqGoRZbEkPYju
 qEkp4oT3j/1siBz1sMOjC5tfmAzhLvAeL61pD2EOcm5Bpd3iKJibYt/uCIpYFHM0
 WfRcUmqEduN1zhDuSR4KSe49JQ5dFXVf83YqUgbtrHFiHHXNBYYqFNUVfcDAB1p7
 IH9AlNd82zyxJ3fsBX7VpEbGC2qNa3K8hYO7px8DNVrPGzW7AhPF1Lsh0OE9GlZU
 H5f70Nryi98iwadbePBUchTrx0S3iYjk2TQgLGf5L/lAl6J/MRNG31kittDtehri
 cct/JBr8sUAK014TS5NxPbpxqDnVot3UsYk7h6s7WdmM1svfs7j5f1mo3ovMEGqX
 io5Z6pFEE7n1ce5hbieDKr3JFh6LxP1ArUSY8oz5rR0shE2XHMcIdq3J26Vfi0Q0
 4VjdBic/7rUUBXI=
 =QXEK
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Fix boundary check in punch_hole

* tag 'gfs2-v6.8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix invalid metadata access in punch_hole
2024-03-25 10:53:39 -07:00
Eric Van Hensbergen
6630036b7c
fs/9p: fix uninitialized values during inode evict
If an iget fails due to not being able to retrieve information
from the server then the inode structure is only partially
initialized.  When the inode gets evicted, references to
uninitialized structures (like fscache cookies) were being
made.

This patch checks for a bad_inode before doing anything other
than clearing the inode from the cache.  Since the inode is
bad, it shouldn't have any state associated with it that needs
to be written back (and there really isn't a way to complete
those anyways).

Reported-by: syzbot+eb83fe1cce5833cd66a0@syzkaller.appspotmail.com
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-25 14:16:06 +00:00
Dave Chinner
f2e812c152 xfs: don't use current->journal_info
syzbot reported an ext4 panic during a page fault where found a
journal handle when it didn't expect to find one. The structure
it tripped over had a value of 'TRAN' in the first entry in the
structure, and that indicates it tripped over a struct xfs_trans
instead of a jbd2 handle.

The reason for this is that the page fault was taken during a
copy-out to a user buffer from an xfs bulkstat operation. XFS uses
an "empty" transaction context for bulkstat to do automated metadata
buffer cleanup, and so the transaction context is valid across the
copyout of the bulkstat info into the user buffer.

We are using empty transaction contexts like this in XFS to reduce
the risk of failing to release objects we reference during the
operation, especially during error handling. Hence we really need to
ensure that we can take page faults from these contexts without
leaving landmines for the code processing the page fault to trip
over.

However, this same behaviour could happen from any other filesystem
that triggers a page fault or any other exception that is handled
on-stack from within a task context that has current->journal_info
set.  Having a page fault from some other filesystem bounce into XFS
where we have to run a transaction isn't a bug at all, but the usage
of current->journal_info means that this could result corruption of
the outer task's journal_info structure.

The problem is purely that we now have two different contexts that
now think they own current->journal_info. IOWs, no filesystem can
allow page faults or on-stack exceptions while current->journal_info
is set by the filesystem because the exception processing might use
current->journal_info itself.

If we end up with nested XFS transactions whilst holding an empty
transaction, then it isn't an issue as the outer transaction does
not hold a log reservation. If we ignore the current->journal_info
usage, then the only problem that might occur is a deadlock if the
exception tries to take the same locks the upper context holds.
That, however, is not a problem that setting current->journal_info
would solve, so it's largely an irrelevant concern here.

IOWs, we really only use current->journal_info for a warning check
in xfs_vm_writepages() to ensure we aren't doing writeback from a
transaction context. Writeback might need to do allocation, so it
can need to run transactions itself. Hence it's a debug check to
warn us that we've done something silly, and largely it is not all
that useful.

So let's just remove all the use of current->journal_info in XFS and
get rid of all the potential issues from nested contexts where
current->journal_info might get misused by another filesystem
context.

Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25 10:21:01 +05:30
Dave Chinner
15922f5dbf xfs: allow sunit mount option to repair bad primary sb stripe values
If a filesystem has a busted stripe alignment configuration on disk
(e.g. because broken RAID firmware told mkfs that swidth was smaller
than sunit), then the filesystem will refuse to mount due to the
stripe validation failing. This failure is triggering during distro
upgrades from old kernels lacking this check to newer kernels with
this check, and currently the only way to fix it is with offline
xfs_db surgery.

This runtime validity checking occurs when we read the superblock
for the first time and causes the mount to fail immediately. This
prevents the rewrite of stripe unit/width via
mount options that occurs later in the mount process. Hence there is
no way to recover this situation without resorting to offline xfs_db
rewrite of the values.

However, we parse the mount options long before we read the
superblock, and we know if the mount has been asked to re-write the
stripe alignment configuration when we are reading the superblock
and verifying it for the first time. Hence we can conditionally
ignore stripe verification failures if the mount options specified
will correct the issue.

We validate that the new stripe unit/width are valid before we
overwrite the superblock values, so we can ignore the invalid config
at verification and fail the mount later if the new values are not
valid. This, at least, gives users the chance of correcting the
issue after a kernel upgrade without having to resort to xfs-db
hacks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25 10:17:18 +05:30
Gao Xiang
a97b59ed79 erofs: drop experimental warning for FSDAX
As EXT4/XFS filesystems, FSDAX functionality is considered to be stable.
Let's drop this warning.

Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240325005116.106351-1-hsiangkao@linux.alibaba.com
2024-03-25 10:48:15 +08:00
Colin Ian King
10211b4a23
fs/9p: remove redundant pointer v9ses
Pointer v9ses is being assigned the value from the return of inlined
function v9fs_inode2v9ses (which just returns inode->i_sb->s_fs_info).
The pointer is not used after the assignment, so the variable is
redundant and can be removed.

Cleans up clang scan warnings such as:
fs/9p/vfs_inode_dotl.c:300:28: warning: variable 'v9ses' set but not
used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-25 00:34:35 +00:00
Lizhi Xu
11763a8598
fs/9p: fix uaf in in v9fs_stat2inode_dotl
The incorrect logical order of accessing the st object code in v9fs_fid_iget_dotl
is causing this uaf.

Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-and-tested-by: syzbot+7a3d75905ea1a830dbe5@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-25 00:34:35 +00:00
Linus Torvalds
ff9c18e435 A patch to minimize blockage when processing very large batches of
dirty caps and two fixes to better handle EOF in the face of multiple
 clients performing reads and size-extending writes at the same time.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmX9xDETHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3HzCACJYjTUKq4v8/LkhzyJM0WTYXKQ+Orz
 BnwgFHGIEiihQKko/7Ks+fcuEGpdy97Rsn9mtmkN0UCKfbcCHwGwflaoYfkkIA4t
 V9pNX0xRwDTyKaENtiI5GVjC/nYcfotbRK4BfURRYKb1xYHq8lO0mOxXwvt5weqQ
 CISWACp7k7eMcX0R0fKT9LemfBDDu2Pxi5ZnDNSdI6Z87Bwdv96jOaCaJ93Azo1W
 Mjr9ddMmaaqsrmaUE3jp58b56nxTrcOUGR7XQUZtjNjEy5h91WazydD4TJFaEQrF
 CQsV5nXHSRT8E4ROUZk8fa7amLs6FGrx307fkOKW02exQPBF7ij/SAt4
 =STLZ
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A patch to minimize blockage when processing very large batches of
  dirty caps and two fixes to better handle EOF in the face of multiple
  clients performing reads and size-extending writes at the same time"

* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
  ceph: set correct cap mask for getattr request for read
  ceph: stop copying to iter at EOF on sync reads
  ceph: remove SLAB_MEM_SPREAD flag usage
  ceph: break the check delayed cap loop every 5s
2024-03-22 11:15:45 -07:00
Linus Torvalds
6f6efce52d Bug fixes for 6.9:
* Fix invalid pointer dereference by initializing xmbuf before tracepoint
   function is invoked.
 * Use memalloc_nofs_save() when inserting into quota radix tree.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZffInwAKCRAH7y4RirJu
 9IyEAP9h8KMNtDRXyBxFe8vCtjoSj7fwwIijWLa/y2NH1oWQowEAk83m14akH1KH
 J0HInearcoRv8L16oe/tcNxxPPBuSQQ=
 =emhS
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix invalid pointer dereference by initializing xmbuf before
   tracepoint function is invoked

 - Use memalloc_nofs_save() when inserting into quota radix tree

* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: quota radix tree allocations need to be NOFS on insert
  xfs: fix dev_t usage in xmbuf tracepoints
2024-03-22 11:12:21 -07:00
Jan Kara
9fe6e9e7b5 nfsd: Fix error cleanup path in nfsd_rename()
Commit a8b0026847 ("rename(): avoid a deadlock in the case of parents
having no common ancestor") added an error bail out path. However this
path does not drop the remount protection that has been acquired. Fix
the cleanup path to properly drop the remount protection.

Fixes: a8b0026847 ("rename(): avoid a deadlock in the case of parents having no common ancestor")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-22 09:52:00 -04:00
Justin Stitt
5248f40973 binfmt: replace deprecated strncpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

There is a _nearly_ identical implementation of fill_psinfo present in
binfmt_elf.c -- except that one uses get_task_comm over strncpy(). Let's
mirror that in binfmt_elf_fdpic.c

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc:  <linux-hardening@vger.kernel.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240321-strncpy-fs-binfmt_elf_fdpic-c-v2-1-0b6daec6cc56@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-21 20:20:52 -07:00
Linus Torvalds
8e938e3986 9 cifs.ko changesets
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmX83AYACgkQiiy9cAdy
 T1GxyQwApHv/GjrvWXZAHoF2w6BOZSdirHaFPHL+9hqwmxDQ8vcQntqDqyzBtP7R
 aanJdyH8h56e09Dhie46zo6vndHzqD7gqTQtvyHxM3YyXNfW6AH0/NB/hT/UP4s2
 /IFzrWvuH9vobi/UyqjbukVQo3Gix53E1SlkSvERDWvi8ynsHUt4SxeVS9PBid4H
 cZrMeb9FjRmaGLQE3kDEmASsnMFcoGjNiWkfu3TRX0LDFk9TMClzBWGWrWUtNHE+
 QNm6BN7I/mEJP8+W5MSKy20UnWrTH6GkDVuB2E/hJsB4Eo2ggAWTz8X5MYjbx4/A
 f0zx/TbeEiDCGLZpUeySPpz0BjoGVRJkrq1wAXw1H6VfvfwWTQxB+wB57p4SwfVO
 u1By3h5DsUszz4haL34wLSLhLkuMBuO7yMoa5Fnv5gb5NBu6U2IOtMG9WcmPZr6M
 ZUwyxSFZ/l1poinvyI3pQ7pplI3RO57qCUdfFnracVF56qvSby2KXqZ5X907Kgw3
 q3NzTHUc
 =aohZ
 -----END PGP SIGNATURE-----

Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Various get_inode_info_fixes

 - Fix for querying xattrs of cached dirs

 - Four minor cleanup fixes (including adding some header corrections
   and a missing flag)

 - Performance improvement for deferred close

 - Two query interface fixes

* tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  smb311: additional compression flag defined in updated protocol spec
  smb311: correct incorrect offset field in compression header
  cifs: Move some extern decls from .c files to .h
  cifs: remove redundant variable assignment
  cifs: fixes for get_inode_info
  cifs: open_cached_dir(): add FILE_READ_EA to desired access
  cifs: reduce warning log level for server not advertising interfaces
  cifs: make sure server interfaces are requested only for SMB3+
  cifs: defer close file handles having RH lease
2024-03-21 19:14:28 -07:00
Linus Torvalds
85a79128c4 This pull request contains updates for UBI and UBIFS:
UBI:
         - Add Zhihao Cheng as reviewer
 	- Attach via device tree
 	- Add NVMEM layer
 	- Various fastmap related fixes
 
 UBIFS:
         - Add Zhihao Cheng as reviewer
 	- Convert to folios
 	- Various fixes (memory leaks in error paths, function prototypes)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmX8kjUWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSUcD/sFJyv3oD9qqt+OZJUI2b84nHdk
 7EXC4vAd1ioTZzQS0txWx8rPPrhi/XKKGIea71qkDpHyi3foT0n2MlELHNpIZaoH
 r8F50LeMzxBC7NEdGMaU4JYR5FOhNrLJanF5H1MEiiN+IaovhPWrA0V9ViWvS8tM
 e3WDA3tEPo2bbpkzgstjow7YxIAD4OcXhgkFxqb0j299zZzO9GmhLqTlyaidBFne
 VJIjurHd4ixgFEBRJGxAxcAdST5ONwx5RmlTy+9/lubn326jRz5VTRj6pkcugjvn
 odyPeLHc3jEXGP+6qvtyuL2jy6AqyRksXQvZYgP5iL8m2+ga0Edj8/zfoiGPnjRN
 ukYIFI2l9Qv4jUsByHX/klSdILL2L5gK2G5u9LrgDameOTnBcQH/i/TBb1MWzPCA
 O48XJo8T0XvwOLCbgHOuQ7+yKKaI49C9AtM2cbrMRL1gJJKjUsXcC5YZu+3a9+Fi
 TO0o0Y61GKS893mmMznhQqTMMr+5JMMlHJ6C7F6pXdt90twThwABZidWQz1uZc2h
 s+KWo7ts5itxBLW4XP8oue4aBsRdVTQ0IbYcB7j+EXE3EjY7CEge2SNHY6/7eiEK
 Y86M75svkMkQdbLNgV+iSUrn7Uddozm14eHL6wIrWv8Pe9bx0OFlCTFsXzhM37hK
 EK3aNxhyIHk5EFkGHA==
 =70g8
 -----END PGP SIGNATURE-----

Merge tag 'ubifs-for-linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI and UBIFS updates from Richard Weinberger:
 "UBI:
   - Add Zhihao Cheng as reviewer
   - Attach via device tree
   - Add NVMEM layer
   - Various fastmap related fixes

  UBIFS:
   - Add Zhihao Cheng as reviewer
   - Convert to folios
   - Various fixes (memory leaks in error paths, function prototypes)"

* tag 'ubifs-for-linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: (34 commits)
  mtd: ubi: fix NVMEM over UBI volumes on 32-bit systems
  mtd: ubi: provide NVMEM layer over UBI volumes
  mtd: ubi: populate ubi volume fwnode
  mtd: ubi: introduce pre-removal notification for UBI volumes
  mtd: ubi: attach from device tree
  mtd: ubi: block: use notifier to create ubiblock from parameter
  dt-bindings: mtd: ubi-volume: allow UBI volumes to provide NVMEM
  dt-bindings: mtd: add basic bindings for UBI
  ubifs: Queue up space reservation tasks if retrying many times
  ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
  ubifs: dbg_check_idx_size: Fix kmemleak if loading znode failed
  ubi: Correct the number of PEBs after a volume resize failure
  ubi: fix slab-out-of-bounds in ubi_eba_get_ldesc+0xfb/0x130
  ubi: correct the calculation of fastmap size
  ubifs: Remove unreachable code in dbg_check_ltab_lnum
  ubifs: fix function pointer cast warnings
  ubifs: fix sort function prototype
  ubi: Check for too small LEB size in VTBL code
  MAINTAINERS: Add Zhihao Cheng as UBI/UBIFS reviewer
  ubifs: Convert populate_page() to take a folio
  ...
2024-03-21 15:09:29 -07:00
Linus Torvalds
241590e5a1 Driver core changes for 6.9-rc1
Here is the "big" set of driver core and kernfs changes for 6.9-rc1.
 
 Nothing all that crazy here, just some good updates that include:
   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)
   - kobject lock contention fixes from Eric Dumazet
   - driver core cleanups from Andy
   - kernfs rcu work from Tejun
   - fw_devlink changes to resolve some reported issues
   - other minor changes, all details in the shortlog
 
 All of these have been in linux-next for a long time with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwsHg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynT4ACePcNRAsYrINlOPPKPHimJtyP01yEAn0pZYnj2
 0/UpqIqf3HVPu7zsLKTa
 =vR9S
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core and kernfs changes for 6.9-rc1.

  Nothing all that crazy here, just some good updates that include:

   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)

   - kobject lock contention fixes from Eric Dumazet

   - driver core cleanups from Andy

   - kernfs rcu work from Tejun

   - fw_devlink changes to resolve some reported issues

   - other minor changes, all details in the shortlog

  All of these have been in linux-next for a long time with no reported
  issues"

* tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
  device: core: Log warning for devices pending deferred probe on timeout
  driver: core: Use dev_* instead of pr_* so device metadata is added
  driver: core: Log probe failure as error and with device metadata
  of: property: fw_devlink: Add support for "post-init-providers" property
  driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link
  driver core: Adds flags param to fwnode_link_add()
  debugfs: fix wait/cancellation handling during remove
  device property: Don't use "proxy" headers
  device property: Move enum dev_dma_attr to fwnode.h
  driver core: Move fw_devlink stuff to where it belongs
  driver core: Drop unneeded 'extern' keyword in fwnode.h
  firmware_loader: Suppress warning on FW_OPT_NO_WARN flag
  sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group.
  firmware_loader: introduce __free() cleanup hanler
  platform-msi: Remove usage of the deprecated ida_simple_xx() API
  sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE()
  sysfs: Document new "group visible" helpers
  sysfs: Fix crash on empty group attributes array
  sysfs: Introduce a mechanism to hide static attribute_groups
  sysfs: Introduce a mechanism to hide static attribute_groups
  ...
2024-03-21 13:34:15 -07:00
Max Filippov
2aea94ac14 exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
In NOMMU kernel the value of linux_binprm::p is the offset inside the
temporary program arguments array maintained in separate pages in the
linux_binprm::page. linux_binprm::exec being a copy of linux_binprm::p
thus must be adjusted when that array is copied to the user stack.
Without that adjustment the value passed by the NOMMU kernel to the ELF
program in the AT_EXECFN entry of the aux array doesn't make any sense
and it may break programs that try to access memory pointed to by that
entry.

Adjust linux_binprm::exec before the successful return from the
transfer_args_to_stack().

Cc: <stable@vger.kernel.org>
Fixes: b6a2fea393 ("mm: variable length argument support")
Fixes: 5edc2a5123 ("binfmt_elf_fdpic: wire up AT_EXECFD, AT_EXECFN, AT_SECURE")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20240320182607.1472887-1-jcmvbkbc@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-21 10:05:47 -07:00
Linus Torvalds
7b65c810a1 for-6.9-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmX8X6gACgkQxWXV+ddt
 WDs9fQ/+NdeEjgwmYD6EdGNB7dZZKz0XvZ+/2SyDTaQHRazPSmMQh7cHyCzyckmW
 ORRcluGWo+m7cg2bdW9c1KMFdvCaEfuDnDxzyMGvE1iy9gY6KuLbqFIZ1jQe7I6t
 X8voNKLYhF8W3nrBP+PR/PSi61Op2jzrpgzPKdQZFvV4UlTnsNQJyG77IgDB/FFu
 vvnQBtW1xsufnxzYX7+S2066GpCcxqQlaWcgbBRzDcWvJBtzte2hnX2A/wX19evO
 nOqSaWcvALYCPBXfJ4VQ9Znjd3dnU0p2Gf4bp/eTClNB9h5QiDtMCr54/OKT2O8W
 bdqg6RqqWTHtKTWW9MCrWN2qLT8aPoRQJTFv91D2njSelemLVGrBnxXWZeDpB6kX
 0GC4Iqld+F1AM1lOd91D6V7ICQAwf1msp3YE/cokCLZozssKHN+4wrk3lyngWgDT
 AnvRRFTC9TOqLavobI6Upfc/jxP3ZkrSuacgJCuIILvptCLOyVmUqNhQyXx5GVEm
 TeARUeLnNrvnmaXWiW3tSRNZd52VGsoGqW81N/Uefa0zG5HGnUEJbnHb3HXLgH17
 AxyXKDwPnRnOj3fNl0fjZaTFtSgB3noFMMKZ2j6gRiyh5iG3/f8pZgtxesqcwT1e
 vCdq6b3sYEU8bYkSXeaFvi5WBrUIwquej4k0t/F4I/1hmzk7tTg=
 =B5bp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "Fix a problem found in 6.7 after adding the temp-fsid feature which
  changed device tracking in memory and broke grub-probe. This is used
  on initrd-less systems. There were several iterations of the fix and
  it took longer than expected"

* tag 'for-6.9-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: do not skip re-registration for the mounted device
2024-03-21 09:54:28 -07:00
Linus Torvalds
1b3e251373 Description for this pull request:
- Improve dirsync performance by syncing on a dentry-set rather than
     on a per-directory entry
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmX653kWHGxpbmtpbmpl
 b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCMEZD/9bpRN1V5YezmQUUh928yC0OWai
 aEDZqmzEsrJ7wXKUYmwjCzxO9/h3CwdtWTaJMctz2XlfJ9L62hiGXki4Cc27vfgs
 RGDV7fHpRmRq+JxgZN+UEnfrJx6kA0xrOaoyrbvT0t8pTCyK8yOY28YsltbI8wKd
 yQbWS4u4r1/Gugfry7PeGA5x6fxGcP/kbjB81Q1+/ilJetqELcUH9INOGSwvfzOh
 k9lgF+ujJVauzP1Pbw4fSZhcfXGYu0x4rbwcUAJUvuc3NXbotKEAW/ICJW1bH5En
 nN7IjiCYwMjEpK7H4zaZ61zrTIfe/MYgKLsq5XWYrvAmSL8QIlJiB4gbSpwd95gc
 7AK4d4mgYrF3oZPjYvb7kpbPtrNywQzIhef2W67E+fGifAKSsuI8gf+LYA2a7m5A
 HqTeL+3z1DnCwEGfLkNtEBy4xbv039dftjDcR8qzPs3WgwWU7G1nhwuQttcXC1qB
 p5eYxL1RvCbDXER9K6chdYQg7KkxCThlIGuG6shqQt4ybApyGbngmw7s3laUHMZQ
 3R3eD1UqtDb3vr04/PmvQ8NbLy0js8Q2lmjIowznyO/ahpCw/++OgX7T4Q+Sse6h
 xfGj8YkyeXNF7t6NwXhErdygvovoXQ4ADZIcxlhSc3NKnZUzL7RDEx5yMUzKggHu
 BlZlCJZoOoAm5Gk9Cw==
 =0fPf
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - Improve dirsync performance by syncing on a dentry-set rather than on
   a per-directory entry

* tag 'exfat-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: remove duplicate update parent dir
  exfat: do not sync parent dir if just update timestamp
  exfat: remove unused functions
  exfat: convert exfat_find_empty_entry() to use dentry cache
  exfat: convert exfat_init_ext_entry() to use dentry cache
  exfat: move free cluster out of exfat_init_ext_entry()
  exfat: convert exfat_remove_entries() to use dentry cache
  exfat: convert exfat_add_entry() to use dentry cache
  exfat: add exfat_get_empty_dentry_set() helper
  exfat: add __exfat_get_dentry_set() helper
2024-03-21 09:47:12 -07:00
Linus Torvalds
2395690004 9 ksmbd changesets
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmX6Yr8ACgkQiiy9cAdy
 T1GOEwv/fOi8lx3BdIUoJI/AhlnCHze87MikdD3WAmhFaheTiCanVz+0o26ugDoZ
 Tzq3razDbK4Z6sFGHLI4i4XKkaWpj9RP1UAVckNMbz2IGqR5vR+mKk0Q6BHnViPs
 CbIsqo1Ya84IYBeCyoe99FZxBtDWxx+H16UtFsMro6leyP1bUGSIwfh4fH21hnXt
 OSrR2eMKSbEAQYzFiCnacbhh2ssWw2blnbh4eqxOyuKrU37GXpgfLA9OvlAXmnNa
 DBLgp9fdpng+q+zFOfLPSUburEJBndsvIMX/rZvpKJETL64WWNmjqnNcIuoovEAi
 sMLt4W5MWupcRCpArT/ZmjFLUhsjGUMmAbHaMFLe76Pi4ooeZhWEfakR4g/4g4kF
 Y/XXNEkwrdUQKam9CkkXzCB0DGAwfu5XLhpSTnOTQ0ECbGF5AYKdbChxioTSmxU8
 alz4wviviMY8OxTbJa75MpSac2eDSW6bprO7Q++cVYk1DzR3I36Wts3FH0+uz9fW
 4mEoMnS2
 =Zyn6
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server updates from Steve French:

 - add support for durable file handles (an important data integrity
   feature)

 - fixes for potential out of bounds issues

 - fix possible null dereference in close

 - getattr fixes

 - trivial typo fix and minor cleanup

* tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: remove module version
  ksmbd: fix potencial out-of-bounds when buffer offset is invalid
  ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()
  ksmbd: Fix spelling mistake "connction" -> "connection"
  ksmbd: fix possible null-deref in smb_lazy_parent_lease_break_close
  ksmbd: add support for durable handles v1/v2
  ksmbd: mark SMB2_SESSION_EXPIRED to session when destroying previous session
  ksmbd: retrieve number of blocks using vfs_getattr in set_file_allocation_info
  ksmbd: replace generic_fillattr with vfs_getattr
2024-03-20 16:42:47 -07:00
Steve French
e56bc745fa smb311: additional compression flag defined in updated protocol spec
Added new compression flag that was recently documented, in
addition fix some typos and clarify the sid_attr_data struct
definition.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-20 11:49:44 -05:00
Steve French
68c5818a27 smb311: correct incorrect offset field in compression header
The offset field in the compression header is 32 bits not 16.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reported-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-20 11:49:44 -05:00
David Howells
5b142b37c7 cifs: Move some extern decls from .c files to .h
Move the following:

        extern mempool_t *cifs_sm_req_poolp;
        extern mempool_t *cifs_req_poolp;
        extern mempool_t *cifs_mid_poolp;
        extern bool disable_legacy_dialects;

from various .c files to cifsglob.h.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-20 11:49:24 -05:00
Linus Torvalds
a4145ce1e7 bcachefs fixes for 6.9-rc1
Assorted bugfixes.
 
 Most are fixes for simple assertion pops; the most significant fix is
 for a deadlock in recovery when we have to rewrite large numbers of
 btree nodes to fix errors. This was incorrectly running out of the same
 workqueue as the core interior btree update path - we now give it its
 own single threaded workqueue.
 
 This was visible to users as "bch2_btree_update_start(): error:
 BCH_ERR_journal_reclaim_would_deadlock" - and then recovery hanging.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmX6CoYACgkQE6szbY3K
 bnZp2hAAwAw8haQKeR0+0aAaqTavvBcjcloeKlQhRl+OxV1rAgxcjKGai5txZ9rI
 d4FVOOo7MqHq1oN9Ydsy1+0R70eCFzhDxhT1Ph5MhIzc7nd8lC0GQjO0atx23cni
 4UZgSxi6quEP401MTVhvVbCPLmvfPJLpIBzptJUDS/eysxSZpS4A10gEzipoNjPv
 DOdrsvoo8nQX53tERJ/IxtroFL44p4y8OyZK65NILFF9xZosKz1P9ktrWufmRVoY
 /Hl8SUfhSNJDFW5pIMPOmoG/+RG+hJK4BaiNWPXLaSvO+3PmQskJ2tvHQVNjHQYt
 dMYWcy4hN47XtYvrHG9xmaQP+lZCDijdBrhmik4brqfZbloH43MVdDFysjfIPhUm
 qk+zzb0uE0ZhwRvQOjnYEQpHjXmj7Bm80+dhfNuuiKlhz4bOeDz8UZykJOzgD0zH
 n4cd+nbCxuogkukzLLQMbFv1+MCsCZpStkXP3GQXCK0k+H2briPGALuA74sxfAhH
 ajHLNr6qMU+uB6Ce0oM7e+9dPLfV/NalEwWW7aR/4TamxPBt575Hpjp0BV//BRfD
 IxdEKrMNdbKBJDUj1s5aTwcSF6ae6zHtyQXuKr93mWQqNvVXvX5/FPQYr70uA1VP
 iieBkde7aSTGCbTdTEcY9NcXdT2X/91aobsPvwGeq1z5Y1JJ0nU=
 =J/hu
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Assorted bugfixes.

  Most are fixes for simple assertion pops; the most significant fix is
  for a deadlock in recovery when we have to rewrite large numbers of
  btree nodes to fix errors. This was incorrectly running out of the
  same workqueue as the core interior btree update path - we now give it
  its own single threaded workqueue.

  This was visible to users as "bch2_btree_update_start(): error:
  BCH_ERR_journal_reclaim_would_deadlock" - and then recovery hanging"

* tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: Fix lost wakeup on journal shutdown
  bcachefs; Fix deadlock in bch2_btree_update_start()
  bcachefs: ratelimit errors from async_btree_node_rewrite
  bcachefs: Run check_topology() first
  bcachefs: Improve bch2_fatal_error()
  bcachefs: Fix lost transaction restart error
  bcachefs: Don't corrupt journal keys gap buffer when dropping alloc info
  bcachefs: fix for building in userspace
  bcachefs: bch2_snapshot_is_ancestor() now safe to call in early recovery
  bcachefs: Fix nested transaction restart handling in bch2_bucket_gens_init()
  bcachefs: Improve sysfs internal/btree_updates
  bcachefs: Split out btree_node_rewrite_worker
  bcachefs: Fix locking in bch2_alloc_write_key()
  bcachefs: Avoid extent entry type assertions in .invalid()
  bcachefs: Fix spurious -BCH_ERR_transaction_restart_nested
  bcachefs: Fix check_key_has_snapshot() call
  bcachefs: Change "accounting overran journal reservation" to a warning
2024-03-19 17:27:25 -07:00
Xiubo Li
825b82f6b8 ceph: set correct cap mask for getattr request for read
In case of hitting the file EOF, ceph_read_iter() needs to retrieve the
file size from MDS, and Fr caps aren't neccessary.

[ idryomov: fold into existing retry_op == READ_INLINE branch ]

Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-19 14:35:55 +01:00
Xiubo Li
1065da21e5 ceph: stop copying to iter at EOF on sync reads
If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read.  Then, when retrying, the remainder of the range may be
skipped incorrectly.

Ensure that the "to" iter is advanced only until EOF.

[ idryomov: changelog ]

Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-19 14:35:55 +01:00
Yuezhang Mo
dc38fdc51b exfat: remove duplicate update parent dir
For renaming, the directory only needs to be updated once if it
is in the same directory.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:56:10 +09:00
Yuezhang Mo
96cf51accc exfat: do not sync parent dir if just update timestamp
When sync or dir_sync is enabled, there is no need to sync the
parent directory's inode if only for updating its timestamp.

1. If an unexpected power failure occurs, the timestamp of the
   parent directory is not updated to the storage, which has no
   impact on the user.

2. The number of writes will be greatly reduced, which can not
   only improve performance, but also prolong device life.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:56:05 +09:00
Yuezhang Mo
4d71455976 exfat: remove unused functions
exfat_count_ext_entries() is no longer called, remove it.
exfat_update_dir_chksum() is no longer called, remove it and
rename exfat_update_dir_chksum_with_entry_set() to it.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:56:01 +09:00
Yuezhang Mo
af02c72d0b exfat: convert exfat_find_empty_entry() to use dentry cache
Before this conversion, each dentry traversed needs to be read
from the storage device or page cache. There are at least 16
dentries in a sector. This will result in frequent page cache
searches.

After this conversion, if all directory entries in a sector are
used, the sector only needs to be read once.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:54 +09:00
Yuezhang Mo
d97e060673 exfat: convert exfat_init_ext_entry() to use dentry cache
Before this conversion, in exfat_init_ext_entry(), to init
the dentries in a dentry set, the sync times is equals the
dentry number if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.

After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:49 +09:00
Yuezhang Mo
4e1aa22fea exfat: move free cluster out of exfat_init_ext_entry()
exfat_init_ext_entry() is an init function, it's a bit strange
to free cluster in it. And the argument 'inode' will be removed
from exfat_init_ext_entry(). So this commit changes to free the
cluster in exfat_remove_entries().

Code refinement, no functional changes.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:45 +09:00
Yuezhang Mo
ff4343da02 exfat: convert exfat_remove_entries() to use dentry cache
Before this conversion, in exfat_remove_entries(), to mark the
dentries in a dentry set as deleted, the sync times is equals
the dentry numbers if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.

After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:40 +09:00
Yuezhang Mo
cf8663fa99 exfat: convert exfat_add_entry() to use dentry cache
After this conversion, if "dirsync" or "sync" is enabled, the
number of synchronized dentries in exfat_add_entry() will change
from 2 to 1.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:36 +09:00
Yuezhang Mo
01da3a5176 exfat: add exfat_get_empty_dentry_set() helper
This helper is used to lookup empty dentry set. If there are
no enough empty dentries at the input location, this helper will
return the number of dentries that need to be skipped for the
next lookup.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:33 +09:00
Yuezhang Mo
7b6bab2359 exfat: add __exfat_get_dentry_set() helper
Since exfat_get_dentry_set() invokes the validate functions of
exfat_validate_entry(), it only supports getting a directory
entry set of an existing file, doesn't support getting an empty
entry set.

To remove the limitation, add this helper.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-03-19 20:55:28 +09:00
Kent Overstreet
2e92d26b25 bcachefs: Fix lost wakeup on journal shutdown
We need to check for journal shutdown first in __journal_res_get() -
after the journal is shutdown, j->watermark won't be changing anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 23:35:42 -04:00
Kent Overstreet
c502b5b878 bcachefs; Fix deadlock in bch2_btree_update_start()
BCH_TRANS_COMMIT_journal_reclaim with watermark != BCH_WATERMARK_reclaim
means nonblocking, and we need the journal_res_get() in
btree_update_start() to respect that.

In a future refactoring we'll be deleting
BCH_TRANS_COMMIT_journal_reclaim and replacing it with an explicit
BCH_TRANS_COMMIT_nonblocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 23:35:42 -04:00
Namjae Jeon
def30e72d8 ksmbd: remove module version
ksmbd module version marking is not needed. Since there is a
Linux kernel version, there is no point in increasing it anymore.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-18 21:21:38 -05:00
Namjae Jeon
c6cd2e8d2d ksmbd: fix potencial out-of-bounds when buffer offset is invalid
I found potencial out-of-bounds when buffer offset fields of a few requests
is invalid. This patch set the minimum value of buffer offset field to
->Buffer offset to validate buffer length.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-18 21:21:33 -05:00
Linus Torvalds
b3603fcb79 dlm for 6.9
- Fix mistaken variable assignment that caused a refcounting problem.
 - Revert a recent change that began using atomic counters where they
   were not needed (for lkb wait_count.)
 - Add comments around forced state reset for waiting lock operations
   during recovery.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmX4raoACgkQOBtzx/yA
 aapyThAAtLcTZXOa9MuZDvLtaQKX4c2MDlqiAhdL0YOYnz3+DAveA8HF1FRbVwL0
 74lA1O/GX0t2TdCrLiq75u+N/Sm2ACtbZEr8z6VeEoxxtOwCVbGKjA0CwDgvhdSe
 hUv5beO4mlguc16l4+u88z1Ta6GylXmWHRL6l2q4dPKmO4qVX6wn9JUT4JHJSQy/
 ACJ3+Lu7ndREBzCmqb4cR4TcHAhBynYmV7IIE3LQprgkCKiX2A3boeOIk+lEhUn5
 aqmwNNF2WDjJ1D5QVKbXu07MraD71rnyZBDuHzjprP01OhgXfUHLIcgdi7GzK8aN
 KnQ9S5hQWHzTiWA/kYgrUq/S5124plm2pMRyh1WDG6g3dhBxh7XsOHUxtgbLaurJ
 LmMxdQgH0lhJ3f+LSm3w8e3m45KxTeCYC2NUVg/icjOGUjAsVx1xMDXzMxoABoWO
 GGVED4i4CesjOyijMuRO9G/0MRb/lIyZkfoZgtHgL20yphmtv0B5XIIz062N28Wf
 PqmsYUz4ESYkxR4u/5VPBey5aYYdhugnOSERC6yH4QQJXyRgGWQn/CSuRrEmJJS2
 CurprPKx99XJZjZE7RJNlvpUrSBcD9Y7R6I3vo6RyrUCNwPJ0Y+Qvydvc9FoMN3R
 tn7fJe7tDfEEsukhGkwp90vK3MLbW5iKv7IaAxyALdSW12A23WM=
 =6RCz
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:

 - Fix mistaken variable assignment that caused a refcounting problem

 - Revert a recent change that began using atomic counters where they
   were not needed (for lkb wait_count)

 - Add comments around forced state reset for waiting lock operations
   during recovery

* tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: add comments about forced waiters reset
  dlm: revert atomic_t lkb_wait_count
  dlm: fix user space lkb refcounting
2024-03-18 15:39:48 -07:00
Linus Torvalds
ad584d73a2 Tracing updates for 6.9:
Main user visible change:
 
 - User events can now have "multi formats"
 
   The current user events have a single format. If another event is created
   with a different format, it will fail to be created. That is, once an
   event name is used, it cannot be used again with a different format. This
   can cause issues if a library is using an event and updates its format.
   An application using the older format will prevent an application using
   the new library from registering its event.
 
   A task could also DOS another application if it knows the event names, and
   it creates events with different formats.
 
   The multi-format event is in a different name space from the single
   format. Both the event name and its format are the unique identifier.
   This will allow two different applications to use the same user event name
   but with different payloads.
 
 - Added support to have ftrace_dump_on_oops dump out instances and
   not just the main top level tracing buffer.
 
 Other changes:
 
 - Add eventfs_root_inode
 
   Only the root inode has a dentry that is static (never goes away) and
   stores it upon creation. There's no reason that the thousands of other
   eventfs inodes should have a pointer that never gets set in its
   descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode
   descriptor and a dentry pointer, and only the root inode will use this.
 
 - Added WARN_ON()s in eventfs
 
   There's some conditionals remaining in eventfs that should never be hit,
   but instead of removing them, add WARN_ON() around them to make sure that
   they are never hit.
 
 - Have saved_cmdlines allocation also include the map_cmdline_to_pid array
 
   The saved_cmdlines structure allocates a large amount of data to hold its
   mappings. Within it, it has three arrays. Two are already apart of it:
   map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by
   also including the map_cmdline_to_pid[] array as well.
 
 - Restructure __string() and __assign_str() macros used in TRACE_EVENT().
 
   Dynamic strings in TRACE_EVENT() are declared with:
 
       __string(name, source)
 
   And assigned with:
 
      __assign_str(name, source)
 
   In the tracepoint callback of the event, the __string() is used to get the
   size needed to allocate on the ring buffer and __assign_str() is used to
   copy the string into the ring buffer. There's a helper structure that is
   created in the TRACE_EVENT() macro logic that will hold the string length
   and its position in the ring buffer which is created by __string().
 
   There are several trace events that have a function to create the string
   to save. This function is executed twice. Once for __string() and again
   for __assign_str(). There's no reason for this. The helper structure could
   also save the string it used in __string() and simply copy that into
   __assign_str() (it also already has its length).
 
   By using the structure to store the source string for the assignment, it
   means that the second argument to __assign_str() is no longer needed.
 
   It will be removed in the next merge window, but for now add a warning if
   the source string given to __string() is different than the source string
   given to __assign_str(), as the source to __assign_str() isn't even used
   and will be going away.
 
 - Added checks to make sure that the source of __string() is also the
   source of __assign_str() so that it can be safely removed in the next
   merge window.
 
   Included fixes that the above check found.
 
 - Other minor clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfhbUBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrhJAP9bfnYO7tfNGZVNPmTT7Fz0z4zCU1Pb
 P8M+24yiFTeFWwD/aIPlMFZONVkTdFAlLdffl6kJOKxZ7vW4XzUjfNWb6wo=
 =z/D6
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:
 "Main user visible change:

   - User events can now have "multi formats"

     The current user events have a single format. If another event is
     created with a different format, it will fail to be created. That
     is, once an event name is used, it cannot be used again with a
     different format. This can cause issues if a library is using an
     event and updates its format. An application using the older format
     will prevent an application using the new library from registering
     its event.

     A task could also DOS another application if it knows the event
     names, and it creates events with different formats.

     The multi-format event is in a different name space from the single
     format. Both the event name and its format are the unique
     identifier. This will allow two different applications to use the
     same user event name but with different payloads.

   - Added support to have ftrace_dump_on_oops dump out instances and
     not just the main top level tracing buffer.

  Other changes:

   - Add eventfs_root_inode

     Only the root inode has a dentry that is static (never goes away)
     and stores it upon creation. There's no reason that the thousands
     of other eventfs inodes should have a pointer that never gets set
     in its descriptor. Create a eventfs_root_inode desciptor that has a
     eventfs_inode descriptor and a dentry pointer, and only the root
     inode will use this.

   - Added WARN_ON()s in eventfs

     There's some conditionals remaining in eventfs that should never be
     hit, but instead of removing them, add WARN_ON() around them to
     make sure that they are never hit.

   - Have saved_cmdlines allocation also include the map_cmdline_to_pid
     array

     The saved_cmdlines structure allocates a large amount of data to
     hold its mappings. Within it, it has three arrays. Two are already
     apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory
     can be saved by also including the map_cmdline_to_pid[] array as
     well.

   - Restructure __string() and __assign_str() macros used in
     TRACE_EVENT()

     Dynamic strings in TRACE_EVENT() are declared with:

         __string(name, source)

     And assigned with:

        __assign_str(name, source)

     In the tracepoint callback of the event, the __string() is used to
     get the size needed to allocate on the ring buffer and
     __assign_str() is used to copy the string into the ring buffer.
     There's a helper structure that is created in the TRACE_EVENT()
     macro logic that will hold the string length and its position in
     the ring buffer which is created by __string().

     There are several trace events that have a function to create the
     string to save. This function is executed twice. Once for
     __string() and again for __assign_str(). There's no reason for
     this. The helper structure could also save the string it used in
     __string() and simply copy that into __assign_str() (it also
     already has its length).

     By using the structure to store the source string for the
     assignment, it means that the second argument to __assign_str() is
     no longer needed.

     It will be removed in the next merge window, but for now add a
     warning if the source string given to __string() is different than
     the source string given to __assign_str(), as the source to
     __assign_str() isn't even used and will be going away.

   - Added checks to make sure that the source of __string() is also the
     source of __assign_str() so that it can be safely removed in the
     next merge window.

     Included fixes that the above check found.

   - Other minor clean ups and fixes"

* tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
  tracing: Add __string_src() helper to help compilers not to get confused
  tracing: Use strcmp() in __assign_str() WARN_ON() check
  tracepoints: Use WARN() and not WARN_ON() for warnings
  tracing: Use div64_u64() instead of do_div()
  tracing: Support to dump instance traces by ftrace_dump_on_oops
  tracing: Remove second parameter to __assign_rel_str()
  tracing: Add warning if string in __assign_str() does not match __string()
  tracing: Add __string_len() example
  tracing: Remove __assign_str_len()
  ftrace: Fix most kernel-doc warnings
  tracing: Decrement the snapshot if the snapshot trigger fails to register
  tracing: Fix snapshot counter going between two tracers that use it
  tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
  tracing: Use ? : shortcut in trace macros
  tracing: Do not calculate strlen() twice for __string() fields
  tracing: Rework __assign_str() and __string() to not duplicate getting the string
  cxl/trace: Properly initialize cxl_poison region name
  net: hns3: tracing: fix hclgevf trace event strings
  drm/i915: Add missing ; to __assign_str() macros in tracepoint code
  NFSD: Fix nfsd_clid_class use of __string_len() macro
  ...
2024-03-18 15:11:44 -07:00
Chengming Zhou
a8922f7967 ceph: remove SLAB_MEM_SPREAD flag usage
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series [1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-18 22:03:29 +01:00
Xiubo Li
09927e7ef1 ceph: break the check delayed cap loop every 5s
In some cases this may take a long time and will block renewing
the caps to MDS.

[ idryomov: massage comment ]

Link: https://tracker.ceph.com/issues/50223#note-21
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-18 22:03:29 +01:00
Linus Torvalds
bf3a69c686 One fix, one cleanup...
Fix:
 Julia Lawall pointed out a null pointer dereference.
 
 Cleanup:
 Vlastimil Babka sent me a patch to remove some SLAB related code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAmXSUcsACgkQz0QOqevO
 Db5VmxAAiQlsxX2Ki3q00rMgaXFS4yTwSPsGsTrC5eQlyJfq4xEGMTWJGjjRS56k
 L2FGioP7OmIlo2VrzCc9Ms4ve/NyQjXpaoDMnsEUWSUfd8OHSJkBrpeVUWWfcHHk
 zLEmxNb2AETcJupAgoJOWOoSb59ggCKpCqmLezYoSZmlnb9qg6lhFbnWtkVC6q+p
 AREOfByoLIrJUtVh4Bmexo4nO5w3F84cfAV2WAmLMnXKjGnyFLGkqQvy8yXW0sA1
 hsZW+VjmoRaCG78M6OX+Sl3ok0V8i2AcOkguWPj+dkCPb8XLkxhGwjnqLziJ54Z5
 aFrtSzeZiNQOqy7b6cj6+x2KcWE5FhphAKjX/psEZrZNa0e6ZvNfby3yJ4TzNWaN
 eajtOtcq+Ec9IruWXt/WCsm0zYwW1HumUhga5QCHjQRRjOt36ua4QC02iCx2sYuX
 SBnsBCgQo1xxAta3uOMj2sG38lUwYoH0U5wlPsqrGh1nsbGbc49Ok7BYX/wWF8os
 CYnT5t2KR9yUvblV+dH9XTj2EwqgINMRYBW7uBjZqY9gq2v/RKrtQCjnedAAA+yx
 B6UUob/naV5VpXfhwpXiw2oJrQy/kqQuwOEcgY/a6cwmENd9RuwGXBiBI6hrAOzl
 ftxiUcByW8/hS13G04qJ7pGTACs4njteMvvg+Y68nWUmPWMZTio=
 =/gAC
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "One fix, one cleanup...

  Fix: Julia Lawall pointed out a null pointer dereference.

  Cleanup: Vlastimil Babka sent me a patch to remove some SLAB related
  code"

* tag 'for-linus-6.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  Julia Lawall reported this null pointer dereference, this should fix it.
  fs/orangefs: remove ORANGEFS_CACHE_CREATE_FLAGS
2024-03-18 12:15:19 -07:00
Linus Torvalds
c5d9ab85eb f2fs update for 6.9-rc1
In this round, there are a number of updates on mainly two areas: Zoned block
 device support and Per-file compression. For example, we've found several issues
 to support Zoned block device especially having large sections regarding to GC
 and file pinning used for Android devices. In compression side, we've fixed many
 corner race conditions that had broken the design assumption.
 
 Enhancement:
  - Support file pinning for Zoned block device having large section
  - Enhance the data recovery after sudden power cut on Zoned block device
  - Add more error injection cases to easily detect the kernel panics
  - add a proc entry show the entire disk layout
  - Improve various error paths paniced by BUG_ON in block allocation and GC
  - support SEEK_DATA and SEEK_HOLE for compression files
 
 Bug fix:
  - fix to avoid use-after-free issue in f2fs_filemap_fault
  - fix some race conditions to break the atomic write design assumption
  - fix to truncate meta inode pages forcely
  - resolve various per-file compression issues wrt the space management and
    compression policies
  - fix some swap-related bugs
 
 In addition, we removed deprecated codes such as io_bits and heap_allocation,
 and also fixed minor error handling routines with neat debugging messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmX4gS0ACgkQQBSofoJI
 UNLmgBAAg4mvbWjmJ5VbXs4zGLOgLRJYcY1sZRO5Ufg4LhWzoGRxL1Dru+TELw0t
 1Ck2EQvP91XZ5weA5AZOfWbxcijy4+8L3P8L7ohOShudfACci0wQsx6IaUUWWylC
 ILA4+DkovpZrlu6th12Gj9QAM6TN9gdy3V1VLT5O/KmE1x6Pekwp2hQoIvVJRH5L
 I3KxOf5fTe3oWLvEN6m7yCz/8qGqz8+w0ae90UG0fqi0wVEuZJ99zsVPnuhu6uBo
 riFm2A6ra0I/JqoPyqn2QM6ApItM867ULo9EoyQVgq56Q1w31ENOJXsU9N7N4Wxt
 olgujH1SijkWk9ni57iKtMhR68e3Rs+pVsuNFmJuOPq0HASoggB66QRrVvCgM9JG
 z3D//CB2ONtX2XiKJMiTcX9VqIqrMw6L1eVxEZu0P96C3CS70MoBU69mdSR9Og2S
 5nQXja3yzFhdk3thp6+wAJ3I04ZQkf3qoHZB+0chU2Xl1pV+5NIkBgBsSw8g/TY3
 EIHMfK+TX0SBSNCvkUDEJ+Z8ZRID6tcbAquTSsBr6wxB+F9mq7onEvI8O7xwyH9W
 DU8xhymOE2QUoluNtyW7ww6HK913ripXIenI9LaYJnuj0XeDAcMIoPsgR7AGU5UG
 hshvirFdUdWRMTfXxNNUrvhOWI0qurQSVx+VV6Qb62DGqR5ofOw=
 =Qpvy
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs update from Jaegeuk Kim:
 "In this round, there are a number of updates on mainly two areas:
  Zoned block device support and Per-file compression. For example,
  we've found several issues to support Zoned block device especially
  having large sections regarding to GC and file pinning used for
  Android devices. In compression side, we've fixed many corner race
  conditions that had broken the design assumption.

  Enhancements:
   - Support file pinning for Zoned block device having large section
   - Enhance the data recovery after sudden power cut on Zoned block
     device
   - Add more error injection cases to easily detect the kernel panics
   - add a proc entry show the entire disk layout
   - Improve various error paths paniced by BUG_ON in block allocation
     and GC
   - support SEEK_DATA and SEEK_HOLE for compression files

  Bug fixes:
   - avoid use-after-free issue in f2fs_filemap_fault
   - fix some race conditions to break the atomic write design
     assumption
   - fix to truncate meta inode pages forcely
   - resolve various per-file compression issues wrt the space
     management and compression policies
   - fix some swap-related bugs

  In addition, we removed deprecated codes such as io_bits and
  heap_allocation, and also fixed minor error handling routines with
  neat debugging messages"

* tag 'f2fs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (60 commits)
  f2fs: fix to avoid use-after-free issue in f2fs_filemap_fault
  f2fs: truncate page cache before clearing flags when aborting atomic write
  f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flag
  f2fs: prevent atomic write on pinned file
  f2fs: fix to handle error paths of {new,change}_curseg()
  f2fs: unify the error handling of f2fs_is_valid_blkaddr
  f2fs: zone: fix to remove pow2 check condition for zoned block device
  f2fs: fix to truncate meta inode pages forcely
  f2fs: compress: fix reserve_cblocks counting error when out of space
  f2fs: compress: relocate some judgments in f2fs_reserve_compress_blocks
  f2fs: add a proc entry show disk layout
  f2fs: introduce SEGS_TO_BLKS/BLKS_TO_SEGS for cleanup
  f2fs: fix to check return value of f2fs_gc_range
  f2fs: fix to check return value __allocate_new_segment
  f2fs: fix to do sanity check in update_sit_entry
  f2fs: fix to reset fields for unloaded curseg
  f2fs: clean up new_curseg()
  f2fs: relocate f2fs_precache_extents() in f2fs_swap_activate()
  f2fs: fix blkofs_end correctly in f2fs_migrate_blocks()
  f2fs: ro: don't start discard thread for readonly image
  ...
2024-03-18 11:26:00 -07:00
Anand Jain
d565fffa68 btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.

This looks like the device name changed in the output of
/proc/self/mountinfo:

6.5-rc5 working

  18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...

6.7 not working:

  17 1 0:15 / / rw,noatime - btrfs /dev/root ...

and "update-grub" shows this error:

  /usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)

This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.

The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).

The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.

This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.

This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.

The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.

In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.

This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.

This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.

Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.

The fix was verified by users whose systems were affected.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-18 19:16:50 +01:00
Linus Torvalds
0d7ca657df overlayfs fixes for 6.9-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmX4AWYACgkQEVvVyTe/
 1WpUXA/4r0qYfDU3M1NYtBhcRcXe44WvfdMxUPgc5IqffHof058FWC6cRhRnKB9U
 R1Wb09xIguX2hWsseyq1MWBuoZUq3nWUNRfTkld0uUX1I97ggV70YqRY2SSUWqCq
 eLMYwR6e5Ar1/noMkHYMLupJknTJsyOOFTXZrAvl5aZDCKvAQUQoVWY772B71/4j
 odH7wpgVX4Ah6U5Zk3anEdn5TVcvl4XXW4U1o4kJzUjIgxI9ZyrBpN6EmoSgpoQj
 zRfbQ6OrT2FNAPc/y3Dwn9fAoUZT2F3A/R4dVddBAPyLY96Qj+p0fv+y+0RY9efV
 dcf0bvoj4xXT+9KDVVp2rWszmFnr0niu+dE6bCkzW/lC2Uz/NIghjKcgwOGIjqpb
 9o8CzqGOF6B07A7SoVtcQlZR+7iUn+SUPEIk8JpvDkDLCLBOqd6rc58MnlMwJwrs
 OsMKH8IQKb6bGO0wfxmeyWj8637k9dtc0cF16hHhCxlRoextU3guNw0uBlta29QQ
 J+QCkFR6F2xQH2UbKEpOTsQo+2x385mEByXZdSUGkCM7ABb3cictrIMXyWJMNoWq
 un5tnJ5d/E3N+FKc5r9UWGHguwdYKiucvx7NQC6lK0y3b8aZMyVioiCVYX+lPKHB
 vdSp6F5vBrL6PHVsD9oDjiqLl8gQk7eIc4ne6zs/T9M5h8yHSg==
 =cMqg
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs fixes from Amir Goldstein:
 "Only minor fixes:

   - Fix uncalled for WARN_ON from v6.8-rc1

   - Fix the overlayfs MAINTAINERS entry"

* tag 'ovl-fixes-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: relax WARN_ON in ovl_verify_area()
  MAINTAINERS: update overlayfs git tree
2024-03-18 11:15:58 -07:00
Linus Torvalds
0a7b0acece vfs-6.9-rc1.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZfglxgAKCRCRxhvAZXjc
 ovK9APsF7/TMFhNbtW+JsghSyrEk0cOVPizi8JkRDDWNW3qY+wEAxtydhbmWpbKq
 MpIjMHqwjPx3zXBL8Ec/b4vAoJqpJwQ=
 =NgvO
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes for this merge window:

   - Undo the hiding of silly-rename files in afs. If they're hidden
     they can't be deleted by rm manually anymore causing regressions

   - Avoid caching the preferred address for an afs server to avoid
     accidently overriding an explicitly specified preferred server
     address

   - Fix bad stat() and rmdir() interaction in afs

   - Take a passive reference on the superblock when opening a block
     device so the holder is available to concurrent callers from the
     block layer

   - Clear private data pointer in fscache_begin_operation() to avoid it
     being falsely treated as valid"

* tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fscache: Fix error handling in fscache_begin_operation()
  fs,block: get holder during claim
  afs: Fix occasional rmdir-then-VNOVNODE with generic/011
  afs: Don't cache preferred address
  afs: Revert "afs: Hide silly-rename files from userspace"
2024-03-18 09:15:50 -07:00
Steven Rostedt (Google)
c759e60903 tracing: Remove __assign_str_len()
Now that __assign_str() gets the length from the __string() (and
__string_len()) macros, there's no reason to have a separate
__assign_str_len() macro as __assign_str() can get the length of the
string needed.

Also remove __assign_rel_str() although it had no users anyway.

Link: https://lore.kernel.org/linux-trace-kernel/20240223152206.0b650659@gandalf.local.home

Cc: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:05 -04:00
Steven Rostedt (Google)
9388a2aa45 NFSD: Fix nfsd_clid_class use of __string_len() macro
I'm working on restructuring the __string* macros so that it doesn't need
to recalculate the string twice. That is, it will save it off when
processing __string() and the __assign_str() will not need to do the work
again as it currently does.

Currently __string_len(item, src, len) doesn't actually use "src", but my
changes will require src to be correct as that is where the __assign_str()
will get its value from.

The event class nfsd_clid_class has:

  __string_len(name, name, clp->cl_name.len)

But the second "name" does not exist and causes my changes to fail to
build. That second parameter should be: clp->cl_name.data.

Link: https://lore.kernel.org/linux-trace-kernel/20240222122828.3d8d213c@gandalf.local.home

Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: stable@vger.kernel.org
Fixes: d27b74a867 ("NFSD: Use new __string_len C macros for nfsd_clid_class")
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:17:41 -04:00
David Howells
449ac55146
fscache: Fix error handling in fscache_begin_operation()
Fix fscache_begin_operation() to clear cres->cache_priv on error, otherwise
fscache_resources_valid() will report it as being valid.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3933237.1710514106@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-18 10:33:48 +01:00
Christian Brauner
59a55a63c2
fs,block: get holder during claim
Now that we open block devices as files we need to deal with the
realities that closing is a deferred operation. An operation on the
block device such as e.g., freeze, thaw, or removal that runs
concurrently with umount, tries to acquire a stable reference on the
holder. The holder might already be gone though. Make that reliable by
grabbing a passive reference to the holder during bdev_open() and
releasing it during bdev_release().

Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com
Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-18 10:32:44 +01:00
Kent Overstreet
b38114dde0 bcachefs: ratelimit errors from async_btree_node_rewrite
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 00:24:24 -04:00
Kent Overstreet
8d347a5545 bcachefs: Run check_topology() first
check_topology() doesn't actually require alloc info - and running it
first means other passes don't have to catch btree read errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 00:24:24 -04:00
Kent Overstreet
3ed94062e3 bcachefs: Improve bch2_fatal_error()
error messages should always include __func__

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 00:24:24 -04:00
Kent Overstreet
ec35b30481 bcachefs: Fix lost transaction restart error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-18 00:24:23 -04:00
Kent Overstreet
a586036841 bcachefs: Don't corrupt journal keys gap buffer when dropping alloc info
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 21:17:38 -04:00
Kent Overstreet
f3589bfa7e bcachefs: fix for building in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Kent Overstreet
1c31b83a4e bcachefs: bch2_snapshot_is_ancestor() now safe to call in early recovery
this fixes an assertion pop in
  bch2_check_snapshot_trees() ->
  check_snapshot_tree() ->
  bch2_snapshot_tree_master_subvol() ->
  bch2_snapshot_is_ancestor()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Kent Overstreet
1ba6f48f09 bcachefs: Fix nested transaction restart handling in bch2_bucket_gens_init()
Nested transaction restart handling is typically best avoided; when the
inner context handles a transaction restart it invalidates the outer
transaction context, so we need to make sure to return a
transaction_restart_nested error.

This code wasn't doing that, and hit the assertion in
for_each_btree_key() that checks for that via trans->restart_count.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Kent Overstreet
3becdd4850 bcachefs: Improve sysfs internal/btree_updates
Print out the function that launched the btree update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Kent Overstreet
a0a466ea98 bcachefs: Split out btree_node_rewrite_worker
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Kent Overstreet
37bb9c9572 bcachefs: Fix locking in bch2_alloc_write_key()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:11 -04:00
Kent Overstreet
264b501f8f bcachefs: Avoid extent entry type assertions in .invalid()
After keys have passed bkey_ops.key_invalid we should never see invalid
extent entry types - but .key_invalid itself needs to cope with them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:11 -04:00
Kent Overstreet
109ea419cf bcachefs: Fix spurious -BCH_ERR_transaction_restart_nested
We only need to return transaction_restart_nested when we're inside a
context that's handling transaction restarts.

Also, add a missing check_subdir_count() call.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:11 -04:00
Kent Overstreet
3ff3475611 bcachefs: Fix check_key_has_snapshot() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:11 -04:00
Kent Overstreet
62f35024b2 bcachefs: Change "accounting overran journal reservation" to a warning
This doesn't need to be a BUG_ON(); the actual serious "things break"
condition is if the whole journal write overruns the available space,
and that has a fatal error, not a BUG_ON(). This check indicates we
screwed something up, but it should be a warning.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:11 -04:00
Namjae Jeon
a80a486d72 ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()
If ->NameOffset of smb2_create_req is smaller than Buffer offset of
smb2_create_req, slab-out-of-bounds read can happen from smb2_open.
This patch set the minimum value of the name offset to the buffer offset
to validate name length of smb2_create_req().

Cc: stable@vger.kernel.org
Reported-by: Xuanzhe Yu <yuxuanzhe@outlook.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-17 12:09:01 -05:00
Amir Goldstein
77a28aa476 ovl: relax WARN_ON in ovl_verify_area()
syzbot hit an assertion in copy up data loop which looks like it is
the result of a lower file whose size is being changed underneath
overlayfs.

This type of use case is documented to cause undefined behavior, so
returning EIO error for the copy up makes sense, but it should not be
causing a WARN_ON assertion.

Reported-and-tested-by: syzbot+3abd99031b42acf367ef@syzkaller.appspotmail.com
Fixes: ca7ab48240 ("ovl: add permission hooks outside of do_splice_direct()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-03-17 15:59:41 +02:00
Steven Rostedt (Google)
c3137ab631 eventfs: Create eventfs_root_inode to store dentry
Only the root "events" directory stores a dentry. There's no reason to
hold a dentry pointer for every eventfs_inode as it is never set except
for the root "events" eventfs_inode.

Create a eventfs_root_inode structure that holds the events_dir dentry.
The "events" eventfs_inode *is* special, let it have its own descriptor.

Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.658992558@goodmis.org

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:52 -04:00
Steven Rostedt (Google)
04204cd9b0 eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()
There's a couple of if statements in eventfs_root_lookup() that should
never be true. Instead of removing them, add WARN_ON_ONCE() around them.

  One is a tracefs_inode not being for eventfs.

  The other is a child being freed but still on the parent's children
  list. When a child is freed, it is removed from the list under the
  same mutex that is held during the iteration.

Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
Link: https://lore.kernel.org/linux-trace-kernel/20240201123346.724afa46@gandalf.local.home

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:52 -04:00
Linus Torvalds
c1f10ac840 NFS client updates for Linux 6.9
Highlights include:
 
 Bugfixes:
  - Fix for an Oops in the NFSv4.2 listxattr handler
  - Correct an incorrect buffer size in listxattr
  - Fix for an Oops in the pNFS flexfiles layout
  - Fix a refcount leak in NFS O_DIRECT writes
  - Fix missing locking in NFS O_DIRECT
  - Avoid an infinite loop in pnfs_update_layout
  - Fix an overflow in the RPC waitqueue queue length counter
  - Ensure that pNFS I/O is also protected by TLS when xprtsec
    is specified by the mount options
  - Fix a leaked folio lock in the netfs read code
  - Fix a potential deadlock in fscache
  - Allow setting the fscache uniquifier in NFSv4
  - Fix an off by one in root_nfs_cat()
  - Fix another off by one in rpc_sockaddr2uaddr()
  - nfs4_do_open() can incorrectly trigger state recovery.
  - Various fixes for connection shutdown
 
 Features and cleanups:
  - Ensure that containers only see their own RPC and NFS stats
  - Enable nconnect for RDMA
  - Remove dead code from nfs_writepage_locked()
  - Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO, and
    mount options.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmX14K0ACgkQZwvnipYK
 APLCeg/7Bdah7158TdNxSQAHPo3jzDqZmc933eZC0H8C9whNlu6XIa9fyT6ZrsQr
 qkQ/ztSwsB6yp6vLPSnVdDh5KsndwrInTB874H8y6+8x+KwwuhSQ7Uy8epg5wrO0
 kgiaRYSH7HB7EgUdNY14fHNXkA/DMLHz1F1aw2NVGCYmVCMg7kGV4wYCOH6bI2Ea
 Wu8amZce6D1AbktbdSZcEz2ricR3lGXjCUPMnzRCaSpUmdd2t7d/rsnjTeKU1gb4
 p9zLlOZs9Xe2vMT0ZQI8SEI+Scze82LBy7ykSKyhOjOt4AurVpzQFAvK+3dFZoIq
 lzIHJwabBGNui26CR1k90ZqERLkkk+24i3ccT28HwhTqe5eM/qDCKOVQmuP0F1F8
 QYsnIM+NnmPZveSGAMdOQwlGFQTyJbT5Na1blHTW2R2rjqBzgvfn8fR0vV4L5P7B
 0J8ShmZKVkvb7mtJJhaaI4LF41ciCF8+I5zwpnYQi0tsX370XPNNFbzS3BmPUVFL
 k0uEMVfNy69PkaH4DJWQT9GoE3qiAamkO+EdAlPad6b8QMdJJZxXOmaUzL8YsCHV
 sX5ugsih/Hf5/+QFBCbHEy7G3oeeHsT80yO8nvGT+yy94bv4F+WcM/tviyRbKrls
 t5audBDNRfrAeUlqAQkXfFmAyqP2CGNr29oL62cXL2muFG7d7ys=
 =5n+X
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Bugfixes:
   - Fix for an Oops in the NFSv4.2 listxattr handler
   - Correct an incorrect buffer size in listxattr
   - Fix for an Oops in the pNFS flexfiles layout
   - Fix a refcount leak in NFS O_DIRECT writes
   - Fix missing locking in NFS O_DIRECT
   - Avoid an infinite loop in pnfs_update_layout
   - Fix an overflow in the RPC waitqueue queue length counter
   - Ensure that pNFS I/O is also protected by TLS when xprtsec is
     specified by the mount options
   - Fix a leaked folio lock in the netfs read code
   - Fix a potential deadlock in fscache
   - Allow setting the fscache uniquifier in NFSv4
   - Fix an off by one in root_nfs_cat()
   - Fix another off by one in rpc_sockaddr2uaddr()
   - nfs4_do_open() can incorrectly trigger state recovery
   - Various fixes for connection shutdown

  Features and cleanups:
   - Ensure that containers only see their own RPC and NFS stats
   - Enable nconnect for RDMA
   - Remove dead code from nfs_writepage_locked()
   - Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO,
     and mount options"

* tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits)
  nfs: fix panic when nfs4_ff_layout_prepare_ds() fails
  NFS: trace the uniquifier of fscache
  NFS: Read unlock folio on nfs_page_create_from_folio() error
  NFS: remove unused variable nfs_rpcstat
  nfs: fix UAF in direct writes
  nfs: properly protect nfs_direct_req fields
  NFS: enable nconnect for RDMA
  NFSv4: nfs4_do_open() is incorrectly triggering state recovery
  NFS: avoid infinite loop in pnfs_update_layout.
  NFS: remove sync_mode test from nfs_writepage_locked()
  NFSv4.1/pnfs: fix NFS with TLS in pnfs
  NFS: Fix an off by one in root_nfs_cat()
  nfs: make the rpc_stat per net namespace
  nfs: expose /proc/net/sunrpc/nfs in net namespaces
  sunrpc: add a struct rpc_stats arg to rpc_create_args
  nfs: remove unused NFS_CALL macro
  NFSv4.1: add tracepoint to trunked nfs4_exchange_id calls
  NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt
  SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int
  nfs: fix regression in handling of fsc= option in NFSv4
  ...
2024-03-16 11:44:00 -07:00
Josef Bacik
719fcafe07 nfs: fix panic when nfs4_ff_layout_prepare_ds() fails
We've been seeing the following panic in production

BUG: kernel NULL pointer dereference, address: 0000000000000065
PGD 2f485f067 P4D 2f485f067 PUD 2cc5d8067 PMD 0
RIP: 0010:ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles]
Call Trace:
 <TASK>
 ? __die+0x78/0xc0
 ? page_fault_oops+0x286/0x380
 ? __rpc_execute+0x2c3/0x470 [sunrpc]
 ? rpc_new_task+0x42/0x1c0 [sunrpc]
 ? exc_page_fault+0x5d/0x110
 ? asm_exc_page_fault+0x22/0x30
 ? ff_layout_free_layoutreturn+0x110/0x110 [nfs_layout_flexfiles]
 ? ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles]
 ? ff_layout_cancel_io+0x6f/0x90 [nfs_layout_flexfiles]
 pnfs_mark_matching_lsegs_return+0x1b0/0x360 [nfsv4]
 pnfs_error_mark_layout_for_return+0x9e/0x110 [nfsv4]
 ? ff_layout_send_layouterror+0x50/0x160 [nfs_layout_flexfiles]
 nfs4_ff_layout_prepare_ds+0x11f/0x290 [nfs_layout_flexfiles]
 ff_layout_pg_init_write+0xf0/0x1f0 [nfs_layout_flexfiles]
 __nfs_pageio_add_request+0x154/0x6c0 [nfs]
 nfs_pageio_add_request+0x26b/0x380 [nfs]
 nfs_do_writepage+0x111/0x1e0 [nfs]
 nfs_writepages_callback+0xf/0x30 [nfs]
 write_cache_pages+0x17f/0x380
 ? nfs_pageio_init_write+0x50/0x50 [nfs]
 ? nfs_writepages+0x6d/0x210 [nfs]
 ? nfs_writepages+0x6d/0x210 [nfs]
 nfs_writepages+0x125/0x210 [nfs]
 do_writepages+0x67/0x220
 ? generic_perform_write+0x14b/0x210
 filemap_fdatawrite_wbc+0x5b/0x80
 file_write_and_wait_range+0x6d/0xc0
 nfs_file_fsync+0x81/0x170 [nfs]
 ? nfs_file_mmap+0x60/0x60 [nfs]
 __x64_sys_fsync+0x53/0x90
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Inspecting the core with drgn I was able to pull this

  >>> prog.crashed_thread().stack_trace()[0]
  #0 at 0xffffffffa079657a (ff_layout_cancel_io+0x3a/0x84) in ff_layout_cancel_io at fs/nfs/flexfilelayout/flexfilelayout.c:2021:27
  >>> prog.crashed_thread().stack_trace()[0]['idx']
  (u32)1
  >>> prog.crashed_thread().stack_trace()[0]['flseg'].mirror_array[1].mirror_ds
  (struct nfs4_ff_layout_ds *)0xffffffffffffffed

This is clear from the stack trace, we call nfs4_ff_layout_prepare_ds()
which could error out initializing the mirror_ds, and then we go to
clean it all up and our check is only for if (!mirror->mirror_ds).  This
is inconsistent with the rest of the users of mirror_ds, which have

  if (IS_ERR_OR_NULL(mirror_ds))

to keep from tripping over this exact scenario.  Fix this up in
ff_layout_cancel_io() to make sure we don't panic when we get an error.
I also spot checked all the other instances of checking mirror_ds and we
appear to be doing the correct checks everywhere, only unconditionally
dereferencing mirror_ds when we know it would be valid.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Fixes: b739a5bd9d ("NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-16 12:41:45 -04:00
David Teigland
c53309b912 dlm: add comments about forced waiters reset
When a lock is waiting for a reply for a remote operation, and recovery
interrupts this "waiters" state, the remote operation is voided by the
recovery, and no reply will be processed.  The lkb waiters state for the
remote operation is forcibly reset/cleared, so that the lock operation
can be restarted after recovery.  Improve the comments describing this.

Signed-off-by: David Teigland <teigland@redhat.com>
2024-03-15 17:11:39 -05:00
David Teigland
484b4f90c2 dlm: revert atomic_t lkb_wait_count
Revert "fs: dlm: handle lkb wait count as atomic_t"
This reverts commit 75a7d60134.

This counter does not need to be atomic.  As the comment in
the reverted commit mentions, the counter is protected by
the rsb lock.

Signed-off-by: David Teigland <teigland@redhat.com>
2024-03-15 14:39:28 -05:00
Linus Torvalds
c442a42363 fs/9p changes for the 6.9 merge window
This pull request includes a number of patches
 addressing improvements in the cache portions of the 9p
 client.
 
 The biggest improvements have to do with fixing handling
 of inodes and eliminating duplicate structures and unnecessary
 allocation/release of inode structures and many associated
 unnecessary protocol traffic.  This also dramatically
 reduced code complexity across the code and sets us up to add
 proper temporal cache capabilities.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmX0XX0ACgkQiP/V+0pf
 /5jdgg//UCY+RTB9UNr/B1I5sj1m36l0BfFB/jaAMApH5/bQPF2mgtjD1jQL2qk2
 gbAvR25erDdKiggRAbNU6OGVBIKXHiUPnL5d6UNgeyQQu1wSzStCejTfoQ/mwgTl
 1N2Uge63zz5L0XgJm31gPcKSlJAD1r70UgsPw0ldkimEPqiBQjQu5VQqUWKgrkBe
 hcdXbuZbmXBLfZv9i0XI9BmwIeaSRwY2qx/JUr2eCoAXrDRNfwBRVMAfqpJ6L6I0
 56C5Q2EDjipUnYp1Z3fqg80IUa+MnoF3rbMyMj54JBWxAPyMspamUBltYlZb5pRv
 zvjSJk3Uof8Wzc+beVVbyV2NmC6gC/cm2MnSYZBTX1KWTMl46eiloSrFz+Pv1/O1
 8rcZFQuw8uAox9/4qc1lPCAC0hZVUFelj8NpCPN8aHS3cF0eN/MCQ3qfBkFgo8um
 jMYTUiai1tjdxWUWrUSpSKFHzBcL986+1bnE7qwI3qPXrxrnB9kOhzSilDJSuHnV
 Ei6zinvMEUBmKYLdNaCQ9N1KTKPr0bqXIlEvk4LDaGLGkZbI7eKuN5mCxZB0DXCc
 65WSc5G9NYux9QjobR+TUwwLdiQH84w/3zYS41r/as7VqABcj/X/2b2ZfQ6/WBuF
 I0CmSWz2gfcpQQc642dzZJk7UP4wOVtsW3F2JAQ1GPq/D5UOmWU=
 =1LU1
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p updates from Eric Van Hensbergen:
 "This includes a number of patches addressing improvements in the cache
  portions of the 9p client.

  The biggest improvements have to do with fixing handling of inodes and
  eliminating duplicate structures and unnecessary allocation/release of
  inode structures and many associated unnecessary protocol traffic.
  This also dramatically reduced code complexity across the code and
  sets us up to add proper temporal cache capabilities"

* tag '9p-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: fix dups even in uncached mode
  fs/9p: simplify iget to remove unnecessary paths
  fs/9p: rework qid2ino logic
  fs/9p: Eliminate now unused v9fs_get_inode
  fs/9p: Eliminate redundant non-cache path in mknod
  fs/9p: remove walk and inode allocation from symlink
  fs/9p: convert mkdir to use get_new_inode
  fs/9p: switch vfsmount to use v9fs_get_new_inode
2024-03-15 10:10:31 -07:00
Linus Torvalds
6ce8b2ce0d fuse update for 6.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZfLjeQAKCRDh3BK/laaZ
 PBYQAQDqYZzq91Kn5jdvjaSd+6I/+x7MDLOIP5hPX0HJLuBxWAEAqENoo4Of0GTC
 ltW7DKrQy9E3CMp6VKSLVJPN4BYP9gk=
 =GvOE
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Add passthrough mode for regular file I/O.

   This allows performing read and write (also via memory maps) on a
   backing file without incurring the overhead of roundtrips to
   userspace. For now this is only allowed to privileged servers, but
   this limitation will go away in the future (Amir Goldstein)

 - Fix interaction of direct I/O mode with memory maps (Bernd Schubert)

 - Export filesystem tags through sysfs for virtiofs (Stefan Hajnoczi)

 - Allow resending queued requests for server crash recovery (Zhao Chen)

 - Misc fixes and cleanups

* tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (38 commits)
  fuse: get rid of ff->readdir.lock
  fuse: remove unneeded lock which protecting update of congestion_threshold
  fuse: Fix missing FOLL_PIN for direct-io
  fuse: remove an unnecessary if statement
  fuse: Track process write operations in both direct and writethrough modes
  fuse: Use the high bit of request ID for indicating resend requests
  fuse: Introduce a new notification type for resend pending requests
  fuse: add support for explicit export disabling
  fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode()
  fuse: fix typo for fuse_permission comment
  fuse: Convert fuse_writepage_locked to take a folio
  fuse: Remove fuse_writepage
  virtio_fs: remove duplicate check if queue is broken
  fuse: use FUSE_ROOT_ID in fuse_get_root_inode()
  fuse: don't unhash root
  fuse: fix root lookup with nonzero generation
  fuse: replace remaining make_bad_inode() with fuse_make_bad()
  virtiofs: drop __exit from virtio_fs_sysfs_exit()
  fuse: implement passthrough for mmap
  fuse: implement splice read/write passthrough
  ...
2024-03-15 09:47:14 -07:00
Linus Torvalds
68bf6bfdcf Ext4 bug fixes and cleanups for 6.9-rc1, plus some additional kunit
tests.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmXydHkACgkQ8vlZVpUN
 gaPFcQf/e1DcEw7dITXoOJ16Sz3pI3ykFEae3aIp1C0DoBL6ncnx4NrKJlbKVmfG
 CvYwwaPIILps0W5gwRll0wG8G9wrx+QY+xx5elsFKlfLsiRmkvXEIFPELYvtblcG
 u6fXumpArtH2dbjsmxw+gxEuborl3aeOIWW62dVvarEpfdvFlEwMAfBYlJ/E4HKM
 z74CmR09sr51XuQZTKaUNioyS6qNR/HIBoelJ50Xt6qLZrpfyIxtU/wHbN1GAM5+
 pBXCYxlBaiSJHb8p9R99DT5JqVD5zwrqWscbajEhOJo4QQQacJGGvIOHz6b6FMRV
 +dPnTBh79t8DAktqT6LAf83bmiRCWQ==
 =4/t9
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Ext4 bug fixes and cleanups, plus some additional kunit tests"

* tag 'ext4_for_linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: initialize sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter before use in kunit test
  ext4: hold group lock in ext4 kunit test
  ext4: alloc test super block from sget
  ext4: kunit: use dynamic inode allocation
  ext4: enable meta_bg only when new desc blocks are needed
  ext4: remove unused parameter biop in ext4_issue_discard()
  ext4: remove SLAB_MEM_SPREAD flag usage
  ext4: verify s_clusters_per_group even without bigalloc
  ext4: fix corruption during on-line resize
  ext4: don't report EOPNOTSUPP errors from discard
  ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()
  ext4: fold quota accounting into ext4_xattr_inode_lookup_create()
  ext4: correct best extent lstart adjustment logic
  ext4: forbid commit inconsistent quota data when errors=remount-ro
  ext4: add a hint for block bitmap corrupt state in mb_groups
  ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks()
  ext4: improve error msg for ext4_mb_seq_groups_show
  ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show
  ext4: Add unit test for ext4_mb_mark_diskspace_used
  ext4: Add unit test for mb_free_blocks
  ...
2024-03-15 09:20:30 -07:00
Linus Torvalds
32a50540c3 bcachefs updates for 6.9
- Subvolume children btree; this is needed for providing a userspace
    interface for walking subvolumes, which will come later
  - Lots of improvements to directory structure checking
  - Improved journal pipelining, significantly improving performance on
    high iodepth write workloads
  - Discard path improvements: the discard path is more efficient, and no
    longer flushes the journal unnecessarily
  - Buffered write path can now avoid taking the inode lock
  - new mm helper: memalloc_flags_{save|restore}
  - mempool now does kvmalloc mempools
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXycEcACgkQE6szbY3K
 bnYUTg/+K4Nv2EdAqOCyHRTKaF2OgJDUb25ZDmbGpfT1XyPrNB7/+CxHqSdEP7/e
 FVuhtP61vnQAImDv82u9iZiab/TnuCZPUrjSobFEvrWYoGRtP9Bm9MyYB28NzmMa
 AXGmS4yJGVwtxxrFNxZP98IbiHYiHSoYbkqxX2E5VgLag8Ru8peb7oD0Ro3zw0rb
 z+6UM/seJ7on5i/9IJEMKKXFVEoZC2J5DAVoe1TghG2kgOw3cKu5OUdltLPOY5jL
 jkm5J5wa6Ep46nufHat92yiMxXIQrf4U9LkXxzTi5ThoSmt+Af2qXcBjqTTVqd2D
 1dGxj+UG8iu4DCCbQC6EA7J5EMvxfJM0+9lk1ULUgxUs3X69co6nlI6XH1fwEMqk
 KpIqd35+Y/IYgogt9ioXI0dtXyL7dbaTVt6NZhc9SaPGPX+C2V0+l4bqToFdNaPH
 0KATjjyQaJRE4ZFIjr6GliYOtKWDLi/HPEyoBivniUn7cF5vjSvti+cSQwNDSPpa
 6jOd5Y923Iq9ZqDAPM3+mvTH8nNaaf2T2fmbPNrc5pdWbha9bGwOU71zvKHNFGm/
 66ZsnwhKSk+uwglTMZHPKSkJJXUYAHESw3slQtEWHZVlliArc55+pBHwE00bvRt7
 KHUUqkqXBUPzbp/kdZGylMAdH9+8j9TE5QJ2RaoryFm/eCfexmI=
 =6xnj
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs updates from Kent Overstreet:

 - Subvolume children btree; this is needed for providing a userspace
   interface for walking subvolumes, which will come later

 - Lots of improvements to directory structure checking

 - Improved journal pipelining, significantly improving performance on
   high iodepth write workloads

 - Discard path improvements: the discard path is more efficient, and no
   longer flushes the journal unnecessarily

 - Buffered write path can now avoid taking the inode lock

 - new mm helper: memalloc_flags_{save|restore}

 - mempool now does kvmalloc mempools

* tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits)
  bcachefs: time_stats: shrink time_stat_buffer for better alignment
  bcachefs: time_stats: split stats-with-quantiles into a separate structure
  bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet
  bcachefs: time_stats: add larger units
  bcachefs: pull out time_stats.[ch]
  bcachefs: reconstruct_alloc cleanup
  bcachefs: fix bch_folio_sector padding
  bcachefs: Fix btree key cache coherency during replay
  bcachefs: Always flush write buffer in delete_dead_inodes()
  bcachefs: Fix order of gc_done passes
  bcachefs: fix deletion of indirect extents in btree_gc
  bcachefs: Prefer struct_size over open coded arithmetic
  bcachefs: Kill unused flags argument to btree_split()
  bcachefs: Check for writing superblocks with nonsense member seq fields
  bcachefs: fix bch2_journal_buf_to_text()
  lib/generic-radix-tree.c: Make nodes more reasonably sized
  bcachefs: copy_(to|from)_user_errcode()
  bcachefs: Split out bkey_types.h
  bcachefs: fix lost journal buf wakeup due to improved pipelining
  bcachefs: intercept mountoption value for bool type
  ...
2024-03-15 09:00:09 -07:00
Dave Chinner
0c6ca06aad xfs: quota radix tree allocations need to be NOFS on insert
In converting the XFS code from GFP_NOFS to scoped contexts, we
converted the quota radix tree to GFP_KERNEL. Unfortunately, it was
not clearly documented that this set was because there is a
dependency on the quotainfo->qi_tree_lock being taken in memory
reclaim to remove dquots from the radix tree.

In hindsight this is obvious, but the radix tree allocations on
insert are not immediately obvious, and we avoid this for the inode
cache radix trees by using preloading and hence completely avoiding
the radix tree node allocation under tree lock constraints.

Hence there are a few solutions here. The first is to reinstate
GFP_NOFS for the radix tree and add a comment explaining why
GFP_NOFS is used. The second is to use memalloc_nofs_save() on the
radix tree insert context, which makes it obvious that the radix
tree insert runs under GFP_NOFS constraints. The third option is to
simply replace the radix tree and it's lock with an xarray which can
do memory allocation safely in an insert context.

The first is OK, but not really the direction we want to head. The
second is my preferred short term solution. The third - converting
XFS radix trees to xarray - is the longer term solution.

Hence to fix the regression here, we take option 2 as it moves us in
the direction we want to head with memory allocation and GFP_NOFS
removal.

Reported-by: syzbot+8fdff861a781522bda4d@syzkaller.appspotmail.com
Reported-by: syzbot+d247769793ec169e4bf9@syzkaller.appspotmail.com
Fixes: 94a69db236 ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15 10:30:23 +05:30
Darrick J. Wong
215b2bf72a xfs: fix dev_t usage in xmbuf tracepoints
Fix some inconsistencies in the xmbuf tracepoints -- they should be
reporting the major/minor of the filesystem that they're associated
with, so that we have some clue on whose behalf the xmbuf was created.
Fix the xmbuf_free tracepoint to report the same.

Don't call the trace function until the xmbuf is fully initialized.

Fixes: 5076a6040c ("xfs: support in-memory buffer cache target")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15 10:30:23 +05:30
Linus Torvalds
e5eb28f6d1 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
 
 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".
 
 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits.  The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".
 
 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".
 
 - Ryusuke Konishi continues nilfs2 maintenance work in the series
 
 	"nilfs2: eliminate kmap and kmap_atomic calls"
 	"nilfs2: fix kernel bug at submit_bh_wbc()"
 
 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".
 
 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".
 
 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".
 
 Plus the usual shower of singleton patches in various parts of the tree.
 Please see the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA
 jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh
 tKVmvihFxfAhdDthseXcIf1nBjMALwY=
 =8rVc
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
   heap optimizations".

 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".

 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits. The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".

 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".

 - Ryusuke Konishi continues nilfs2 maintenance work in the series

	"nilfs2: eliminate kmap and kmap_atomic calls"
	"nilfs2: fix kernel bug at submit_bh_wbc()"

 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".

 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".

 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".

Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.

* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  nilfs2: prevent kernel bug at submit_bh_wbc()
  nilfs2: fix failure to detect DAT corruption in btree and direct mappings
  ocfs2: enable ocfs2_listxattr for special files
  ocfs2: remove SLAB_MEM_SPREAD flag usage
  assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
  buildid: use kmap_local_page()
  watchdog/core: remove sysctl handlers from public header
  nilfs2: use div64_ul() instead of do_div()
  mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
  kexec: copy only happens before uchunk goes to zero
  get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
  get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
  get_signal: don't abuse ksig->info.si_signo and ksig->sig
  const_structs.checkpatch: add device_type
  Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
  dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
  list: leverage list_is_head() for list_entry_is_head()
  nilfs2: MAINTAINERS: drop unreachable project mirror site
  smp: make __smp_processor_id() 0-argument macro
  fat: fix uninitialized field in nostale filehandles
  ...
2024-03-14 18:03:09 -07:00
Johannes Thumshirn
74098a989b btrfs: zoned: use zone aware sb location for scrub
At the moment scrub_supers() doesn't grab the super block's location via
the zoned device aware btrfs_sb_log_location() but via btrfs_sb_offset().

This leads to checksum errors on 'scrub' as we're not accessing the
correct location of the super block.

So use btrfs_sb_log_location() for getting the super blocks location on
scrub.

Reported-by: WA AM <waautomata@gmail.com>
Link: http://lore.kernel.org/linux-btrfs/CANU2Z0EvUzfYxczLgGUiREoMndE9WdQnbaawV5Fv5gNXptPUKw@mail.gmail.com
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-15 02:00:26 +01:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Colin Ian King
e758fa6956 ksmbd: Fix spelling mistake "connction" -> "connection"
There is a spelling mistake in a ksmbd_debug debug message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 18:51:32 -05:00
Marios Makassikis
5fb282ba4f ksmbd: fix possible null-deref in smb_lazy_parent_lease_break_close
rcu_dereference can return NULL, so make sure we check against that.

Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 18:51:32 -05:00
Bharath SM
2760161d14 cifs: remove redundant variable assignment
This removes an unnecessary variable assignment. The assigned
value will be overwritten by cifs_fattr_to_inode before it
is accessed, making the line redundant.

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:27 -05:00
Meetakshi Setiya
fc20c52321 cifs: fixes for get_inode_info
Fix potential memory leaks, add error checking, remove unnecessary
initialisation of status_file_deleted and do not use cifs_iget() to get
inode in reparse_info_to_fattr since fattrs may not be fully set.

Fixes: ffceb7640c ("smb: client: do not defer close open handles to deleted files")
Reported-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:27 -05:00
Eugene Korenevsky
f1b8224b4e cifs: open_cached_dir(): add FILE_READ_EA to desired access
Since smb2_query_eas() reads EA and uses cached directory,
open_cached_dir() should request FILE_READ_EA access.

Otherwise listxattr() and getxattr() will fail with EACCES
(0xc0000022 STATUS_ACCESS_DENIED SMB status).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218543
Cc: stable@vger.kernel.org
Signed-off-by: Eugene Korenevsky <ekorenevsky@astralinux.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:27 -05:00
Shyam Prasad N
16a57d7681 cifs: reduce warning log level for server not advertising interfaces
Several users have reported this log getting dumped too regularly to
kernel log. The likely root cause has been identified, and it suggests
that this situation is expected for some configurations
(for example SMB2.1).

Since the function returns appropriately even for such cases, it is
fairly harmless to make this a debug log. When needed, the verbosity
can be increased to capture this log.

Cc: stable@vger.kernel.org
Reported-by: Jan Čermák <sairon@sairon.cz>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:27 -05:00
Shyam Prasad N
13c0a74747 cifs: make sure server interfaces are requested only for SMB3+
Some code paths for querying server interfaces make a false
assumption that it will only get called for SMB3+. Since this
function now can get called from a generic code paths, the correct
thing to do is to have specific handler for this functionality
per SMB dialect, and call this handler.

This change adds such a handler and implements this handler only
for SMB 3.0 and 3.1.1.

Cc: stable@vger.kernel.org
Cc: Jan Čermák <sairon@sairon.cz>
Reported-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:27 -05:00
Bharath SM
dc528770ed cifs: defer close file handles having RH lease
Previously we only deferred closing file handles with RHW
lease. To enhance performance benefits from deferred closes,
we now include handles with RH leases as well.

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-03-14 13:54:16 -05:00
Ryusuke Konishi
269cdf353b nilfs2: prevent kernel bug at submit_bh_wbc()
Fix a bug where nilfs_get_block() returns a successful status when
searching and inserting the specified block both fail inconsistently.  If
this inconsistent behavior is not due to a previously fixed bug, then an
unexpected race is occurring, so return a temporary error -EAGAIN instead.

This prevents callers such as __block_write_begin_int() from requesting a
read into a buffer that is not mapped, which would cause the BUG_ON check
for the BH_Mapped flag in submit_bh_wbc() to fail.

Link: https://lkml.kernel.org/r/20240313105827.5296-3-konishi.ryusuke@gmail.com
Fixes: 1f5abe7e7d ("nilfs2: replace BUG_ON and BUG calls triggerable from ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-14 09:17:30 -07:00
Ryusuke Konishi
f2f26b4a84 nilfs2: fix failure to detect DAT corruption in btree and direct mappings
Patch series "nilfs2: fix kernel bug at submit_bh_wbc()".

This resolves a kernel BUG reported by syzbot.  Since there are two
flaws involved, I've made each one a separate patch.

The first patch alone resolves the syzbot-reported bug, but I think
both fixes should be sent to stable, so I've tagged them as such.


This patch (of 2):

Syzbot has reported a kernel bug in submit_bh_wbc() when writing file data
to a nilfs2 file system whose metadata is corrupted.

There are two flaws involved in this issue.

The first flaw is that when nilfs_get_block() locates a data block using
btree or direct mapping, if the disk address translation routine
nilfs_dat_translate() fails with internal code -ENOENT due to DAT metadata
corruption, it can be passed back to nilfs_get_block().  This causes
nilfs_get_block() to misidentify an existing block as non-existent,
causing both data block lookup and insertion to fail inconsistently.

The second flaw is that nilfs_get_block() returns a successful status in
this inconsistent state.  This causes the caller __block_write_begin_int()
or others to request a read even though the buffer is not mapped,
resulting in a BUG_ON check for the BH_Mapped flag in submit_bh_wbc()
failing.

This fixes the first issue by changing the return value to code -EINVAL
when a conversion using DAT fails with code -ENOENT, avoiding the
conflicting condition that leads to the kernel bug described above.  Here,
code -EINVAL indicates that metadata corruption was detected during the
block lookup, which will be properly handled as a file system error and
converted to -EIO when passing through the nilfs2 bmap layer.

Link: https://lkml.kernel.org/r/20240313105827.5296-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240313105827.5296-2-konishi.ryusuke@gmail.com
Fixes: c3a7abf06c ("nilfs2: support contiguous lookup of blocks")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+cfed5b56649bddf80d6e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cfed5b56649bddf80d6e
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-14 09:17:29 -07:00
Su Yue
41e296f69f ocfs2: enable ocfs2_listxattr for special files
For special files in S_IFBLK/S_IFCHR/S_IFIFO type, we already have
ocfs2_setattr and ocfs2_getattr enabled.  It's confusing for user space if
it can use setattr/getattr to control one attribute appointed but can not
list attributes using listxattr for above type files:

$ mknod /mnt/b b 0 0
$ setfattr -h -n trusted.name -v 0xbabe /mnt/b
$ getfattr -n trusted.name  /mnt/b
getfattr: Removing leading '/' from absolute path names
trusted.name=0sur4=

$ getfattr -m trusted  /mnt/b
$

Fix it by enabling ocfs2_listxattr for ocfs2_special_file_iops.  After the
commit, fstests/generic/062 will pass.

Link: https://lkml.kernel.org/r/20240312042908.8889-1-l@damenly.org
Signed-off-by: Su Yue <glass.su@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-14 09:17:29 -07:00
Chengming Zhou
46bd944946 ocfs2: remove SLAB_MEM_SPREAD flag usage
The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove
its usage so we can delete it from slab. No functional change.

Link: https://lkml.kernel.org/r/20240224135008.829878-1-chengming.zhou@linux.dev
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-14 09:17:29 -07:00
Chao Yu
eb70d5a6c9 f2fs: fix to avoid use-after-free issue in f2fs_filemap_fault
syzbot reports a f2fs bug as below:

BUG: KASAN: slab-use-after-free in f2fs_filemap_fault+0xd1/0x2c0 fs/f2fs/file.c:49
Read of size 8 at addr ffff88807bb22680 by task syz-executor184/5058

CPU: 0 PID: 5058 Comm: syz-executor184 Not tainted 6.7.0-syzkaller-09928-g052d534373b7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x163/0x540 mm/kasan/report.c:488
 kasan_report+0x142/0x170 mm/kasan/report.c:601
 f2fs_filemap_fault+0xd1/0x2c0 fs/f2fs/file.c:49
 __do_fault+0x131/0x450 mm/memory.c:4376
 do_shared_fault mm/memory.c:4798 [inline]
 do_fault mm/memory.c:4872 [inline]
 do_pte_missing mm/memory.c:3745 [inline]
 handle_pte_fault mm/memory.c:5144 [inline]
 __handle_mm_fault+0x23b7/0x72b0 mm/memory.c:5285
 handle_mm_fault+0x27e/0x770 mm/memory.c:5450
 do_user_addr_fault arch/x86/mm/fault.c:1364 [inline]
 handle_page_fault arch/x86/mm/fault.c:1507 [inline]
 exc_page_fault+0x456/0x870 arch/x86/mm/fault.c:1563
 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:570

The root cause is: in f2fs_filemap_fault(), vmf->vma may be not alive after
filemap_fault(), so it may cause use-after-free issue when accessing
vmf->vma->vm_flags in trace_f2fs_filemap_fault(). So it needs to keep vm_flags
in separated temporary variable for tracepoint use.

Fixes: 87f3afd366 ("f2fs: add tracepoint for f2fs_vm_page_mkwrite()")
Reported-and-tested-by: syzbot+763afad57075d3f862f2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000e8222b060f00db3b@google.com
Cc: Ed Tsai <Ed.Tsai@mediatek.com>
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-03-14 09:14:53 -07:00
Sunmin Jeong
74b0ebcbdd f2fs: truncate page cache before clearing flags when aborting atomic write
In f2fs_do_write_data_page, FI_ATOMIC_FILE flag selects the target inode
between the original inode and COW inode. When aborting atomic write and
writeback occur simultaneously, invalid data can be written to original
inode if the FI_ATOMIC_FILE flag is cleared meanwhile.

To prevent the problem, let's truncate all pages before clearing the flag

Atomic write thread              Writeback thread
  f2fs_abort_atomic_write
    clear_inode_flag(inode, FI_ATOMIC_FILE)
                                  __writeback_single_inode
                                    do_writepages
                                      f2fs_do_write_data_page
                                        - use dn of original inode
    truncate_inode_pages_final

Fixes: 3db1de0e58 ("f2fs: change the current atomic write way")
Cc: stable@vger.kernel.org #v5.19+
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com>
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-03-14 09:08:43 -07:00
Sunmin Jeong
4bf7832234 f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flag
In f2fs_update_inode, i_size of the atomic file isn't updated until
FI_ATOMIC_COMMITTED flag is set. When committing atomic write right
after the writeback of the inode, i_size of the raw inode will not be
updated. It can cause the atomicity corruption due to a mismatch between
old file size and new data.

To prevent the problem, let's mark inode dirty for FI_ATOMIC_COMMITTED

Atomic write thread                   Writeback thread
                                        __writeback_single_inode
                                          write_inode
                                            f2fs_update_inode
                                              - skip i_size update
  f2fs_ioc_commit_atomic_write
    f2fs_commit_atomic_write
      set_inode_flag(inode, FI_ATOMIC_COMMITTED)
    f2fs_do_sync_file
      f2fs_fsync_node_pages
        - skip f2fs_update_inode since the inode is clean

Fixes: 3db1de0e58 ("f2fs: change the current atomic write way")
Cc: stable@vger.kernel.org #v5.19+
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com>
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-03-14 09:08:39 -07:00
David Howells
b74c02a379 afs: Fix occasional rmdir-then-VNOVNODE with generic/011
Sometimes generic/011 causes kafs to follow up an FS.RemoveDir RPC call by
spending around a second sending a slew of FS.FetchStatus RPC calls to the
directory just deleted that then abort with VNOVNODE, indicating deletion
of the target directory.

This seems to stem from userspace attempting to stat the directory or
something in it:

    afs_select_fileserver+0x46d/0xaa2
    afs_wait_for_operation+0x12/0x17e
    afs_fetch_status+0x56/0x75
    afs_validate+0xfb/0x240
    afs_permission+0xef/0x1b0
    inode_permission+0x90/0x139
    link_path_walk.part.0.constprop.0+0x6f/0x2f0
    path_lookupat+0x4c/0xfa
    filename_lookup+0x63/0xd7
    vfs_statx+0x62/0x13f
    vfs_fstatat+0x72/0x8a

The issue appears to be that afs_dir_remove_subdir() marks the callback
promise as being cancelled by setting the expiry time to AFS_NO_CB_PROMISE
- which then confuses afs_validate() which sends the FetchStatus to try and
get a new one before it checks for the AFS_VNODE_DELETED flag which
indicates that we know the directory got deleted.

Fix this by:

 (1) Make afs_check_validity() return true if AFS_VNODE_DELETED is set, and
     then tweak the return from afs_validate() if the DELETED flag is set.

 (2) Move the AFS_VNODE_DELETED check in afs_validate() up above the
     expiration check to immediately after we've grabbed the validate_lock.

Fixes: 453924de62 ("afs: Overhaul invalidation handling to better support RO volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240313081505.3060173-3-dhowells@redhat.com
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-14 12:13:21 +01:00
David Howells
83505bde45 afs: Don't cache preferred address
In the AFS fileserver rotation algorithm, don't cache the preferred address
for the server as that will override the explicit preference if a
non-preferred address responds first.

Fixes: 495f2ae9e3 ("afs: Fix fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240313081505.3060173-2-dhowells@redhat.com
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-14 12:13:21 +01:00
David Howells
0aec3847d0 afs: Revert "afs: Hide silly-rename files from userspace"
This reverts commit 57e9d49c54.

This undoes the hiding of .__afsXXXX silly-rename files.  The problem with
hiding them is that rm can't then manually delete them.

This also reverts commit 5f7a076466 ("afs: Fix
endless loop in directory parsing") as that's a bugfix for the above.

Fixes: 57e9d49c54 ("afs: Hide silly-rename files from userspace")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008102.html
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3085695.1710328121@warthog.procyon.org.uk
Reviewed-by: Jeffrey E Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-14 11:51:08 +01:00
Darrick J. Wong
be28368b2c bcachefs: time_stats: shrink time_stat_buffer for better alignment
Shrink this percpu object by one array element so that the object size
becomes exactly 512 bytes.  This will lead to more efficient memory use,
hopefully.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:38:03 -04:00
Darrick J. Wong
273960b8f3 bcachefs: time_stats: split stats-with-quantiles into a separate structure
Currently, struct time_stats has the optional ability to quantize the
information that it collects.  This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464.  For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:38:01 -04:00
Darrick J. Wong
4b4f0876ab bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet
The only caller of this code (time_stats) always knows the weights and
whether or not any information has been collected.  Pass this
information into the mean and variance code so that it doesn't have to
store that information.  This reduces the structure size from 24 to 16
bytes, which shrinks each time_stats counter to 192 bytes from 208.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:37:58 -04:00
Darrick J. Wong
cdbfa228a5 bcachefs: time_stats: add larger units
Filesystems can stay mounted for a very long time, so add some larger
units.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:37:54 -04:00
Kent Overstreet
f1ca1abfb0 bcachefs: pull out time_stats.[ch]
prep work for lifting out of fs/bcachefs/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:30:35 -04:00
Kent Overstreet
cdce109431 bcachefs: reconstruct_alloc cleanup
Now that we've got the errors_silent mechanism, we don't have to check
if the reconstruct_alloc option is set all over the place.

Also - users no longer have to explicitly select fsck and fix_errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
3bbed37214 bcachefs: fix bch_folio_sector padding
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
b3f8e71117 bcachefs: Fix btree key cache coherency during replay
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
5d04409a62 bcachefs: Always flush write buffer in delete_dead_inodes()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
b6fc661f09 bcachefs: Fix order of gc_done passes
gc_stripes_done() and gc_reflink_done() may do alloc btree updates (i.e.
when deleting an indirect extent) - we need bucket gens to be fixed by
then.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
06ebc48306 bcachefs: fix deletion of indirect extents in btree_gc
we need to run the normal extent update path on deletion -
bch2_bkey_make_mut() is incorrect when key type is changing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Erick Archer
3e48999816 bcachefs: Prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "op" variable is a pointer to "struct promote_op" and this
structure ends in a flexible array:

struct promote_op {
	[...]
	struct bio_vec bi_inline_vecs[];
};

and the "t" variable is a pointer to "struct journal_seq_blacklist_table"
and this structure also ends in a flexible array:

struct journal_seq_blacklist_table {
	[...]
	struct journal_seq_blacklist_table_entry {
		u64		start;
		u64		end;
		bool		dirty;
	}			entries[];
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the argument "size + size * count" in the
kzalloc() functions.

This way, the code is more readable and safer.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
1fdb9685ed bcachefs: Kill unused flags argument to btree_split()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
c42006458b bcachefs: Check for writing superblocks with nonsense member seq fields
We're seeing some unmountable filesystems due to split brain detection
going awry; it seems we somehow wrote out superblocks where we updated
the superblock seq without updating any member seq fields.

A given device's superblock should always have the main seq equal to
it's member seq field, so this is easy to check for.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
5e105fb806 bcachefs: fix bch2_journal_buf_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
d64547999c bcachefs: copy_(to|from)_user_errcode()
we've got some helpers that return errors sanely, move them to a more
common location for use in fs-ioctl.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
ba81523eaa bcachefs: Split out bkey_types.h
We're going to need bkey_types.h in bcachefs_ioctl.h in a future patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Brian Foster
ada02c207c bcachefs: fix lost journal buf wakeup due to improved pipelining
The journal_write_done() handler was reworked into a loop in commit
746a33c96b7a ("bcachefs: better journal pipelining"). As part of this,
the journal buffer wake was factored into a post-loop branch that
executes if at least one journal buffer has completed.

The journal buffer processing loop iterates on the journal buffer
pointer, however. This means that w refers to the last buffer processed
by the loop, which may or may not be done. This also means that if
multiple buffers are processed by the loop, only the last is awoken.
This lost wakeup behavior has lead to stalling problems in various CI
and fstests, such as generic/703.

Lift the wake into the loop so each done buffer sees a wake call as
it is processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Hongbo Li
2a68d611a1 bcachefs: intercept mountoption value for bool type
For mount option with bool type, the value must be 0 or 1 (See
bch2_opt_parse). But this seems does not well intercepted cause
for other value(like 2...), it returns the unexpect return code
with error message printed.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Hongbo Li
7e23c1746b bcachefs: avoid returning private error code in bch2_xattr_bcachefs_set
Avoid the private error code return to caller. The error code
should be transformed into genernal error code.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
7e64c86cdc bcachefs: Buffered write path now can avoid the inode lock
Non append, non extending buffered writes can now avoid taking the inode
lock.

To ensure atomicity of writes w.r.t. other writes, we lock every folio
that we'll be writing to, and if this fails we fall back to taking the
inode lock.

Extensive comments are provided as to corner cases.

Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
66a67c860c fs: file_remove_privs_flags()
Rename and export __file_remove_privs(); for a buffered write path that
doesn't take the inode lock we need to be able to check if the operation
needs to do work first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
2024-03-13 21:22:25 -04:00
Kent Overstreet
7efa287526 bcachefs: Fix bch2_journal_noflush_seq()
Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.

Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Hongbo Li
79162e829b bcachefs: fix the error code when mounting with incorrect options.
When mount with incorrect options such as:
"mount -t bcachefs -o errors=back /dev/loop1 /mnt/bcachefs/".
It rebacks the error "mount: /mnt/bcachefs: permission denied."
 cause bch2_parse_mount_opts returns -1 and bch2_mount throws
it up. This is unreasonable.

The real error message should be like this:
"mount: /mnt/bcachefs: wrong fs type, bad option, bad
superblock on /dev/loop1, missing codepage or helper program,
or other error."

Adding three private error codes for mounting error. Here are:
  - BCH_ERR_mount_option as the parent class for option error.
  - BCH_ERR_option_name represents the invalid option name.
  - BCH_ERR_option_value represents the invalid option value.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
2cce3752ce bcachefs: split out ignore_blacklisted, ignore_not_dirty
prep work for replaying the journal backwards

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
69426613cd bcachefs: improve move_gap()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
95ffc7fb8c bcachefs: journal_keys now uses darray helpers
nice bit of code cleanup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
894d062254 bcachefs: Rename journal_keys.d -> journal_keys.data
This will let us use some darray helpers in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
0b5961b0d8 bcachefs: jset_entry for loops declare loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
eb386617be bcachefs: Errcode tracepoint, documentation
Add a tracepoint for downcasting private errors to standard errors, so
they can be recovered even when not logged; also, add some
documentation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Colin Ian King
150194cdcb bcachefs: remove redundant assignment to variable ret
Variable ret is being assigned a value that is never read, it is
being re-assigned a couple of statements later on. The assignment
is redundant and can be removed.

Cleans up clang scan build warning:
fs/bcachefs/super-io.c:806:2: warning: Value stored to 'ret' is
never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Calvin Owens
c7cad231e8 bcachefs: Silence gcc warnings about arm arch ABI drift
32-bit arm builds emit a lot of spam like this:

    fs/bcachefs/backpointers.c: In function ‘extent_matches_bp’:
    fs/bcachefs/backpointers.c:15:13: note: parameter passing for argument of type ‘struct bch_backpointer’ changed in GCC 9.1

Apply the change from commit ebcc5928c5 ("arm64: Silence gcc warnings
about arch ABI drift") to fs/bcachefs/ to silence them.

Signed-off-by: Calvin Owens <jcalvinowens@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
90aa35c4c9 bcachefs: Add journal.blocked to journal_debug_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
d9290c9931 bcachefs: Fix journal_buf bitfield accesses
All jounal_buf bitfield updates must happen under the journal lock -
perhaps we should just switch these to atomic bit flags.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
a393f33123 bcachefs: Split out discard fastpath
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.

Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.

We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.

So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
06d493fee4 bcachefs: improve bch2_journal_buf_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
29e11f9699 bcachefs: Drop redundant btree_path_downgrade()s
If a path doesn't have any active references, we shouldn't downgrade it;
it'll either be reused, possibly with intent refs again, or dropped at
bch2_trans_begin() time.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Daniel Hill
ba78af9e56 bcachefs: rebalance_status now shows correct units
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
3235e04afe bcachefs: more informative write path error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
74406f66ad bcachefs: check_path() now only needs to walk up to subvolume root
Now that checking subvolume structure is a separate pass, the main
check_directory_connectivity() pass only needs to walk up to a given
inode's subvolume root.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
663db5a554 bcachefs: bch2_check_subvolume_structure()
Now that we've got bch_subvolume.fs_path_parent, it's easy to write
subvolume

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Thomas Bertschinger
b07ce72626 bcachefs: omit alignment attribute on big endian struct bkey
This is needed for building Rust bindings on big endian architectures
like s390x. Currently this is only done in userspace, but it might
happen in-kernel in the future. When creating a Rust binding for struct
bkey, the "packed" attribute is needed to get a type with the correct
member offsets in the big endian case. However, rustc does not allow
types to have both a "packed" and "align" attribute. Thus, in order to
get a Rust type compatible with the C type, we must omit the "aligned"
attribute in C.

This does not affect the struct's size or member offsets, only its
toplevel alignment, which should be an acceptable impact.

The little endian version can have the "align" attribute because the
"packed" attr is redundant, and rust-bindgen will omit the "packed" attr
when an "align" attr is present and it can do so without changing a
type's layout

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
6e9d0558b1 bcachefs: bch2_trigger_alloc() handles state changes better
bch2_trigger_alloc() kicks off certain tasks on bucket state changes;
e.g. triggering the bucket discard worker and the invalidate worker.

We've observed the discard worker running too often - most runs it
doesn't do any work, according to the tracepoint - so clearly, we're
kicking it off too often.

This adds an explicit statechange() macro to make these checks more
precise.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
b63570f747 bcachefs: bch2_print_opts()
Make sure early error messages get redirected, for
kernel-fsck-from-userland.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
130d229ff5 bcachefs: Improve error messages in device remove path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
5ca8ff157d bcachefs: Use kvzalloc() when dynamically allocating btree paths
THis silences a mm/page_alloc.c warning about allocating more than a
page with GFP_NOFAIL - and there's no reason for this to not have a
vmalloc fallback anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
83bd5985fa bcachefs: Track iter->ip_allocated at bch2_trans_copy_iter()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
3254c1b0e5 bcachefs: Save key_cache_path in peek_slot()
When bch2_btree_iter_peek_slot() clones the iterator to search for the
next key, and then discovers that the key from the cloned iterator is
the key we want to return - we also want to save the
iter->key_cache_path as well, for the update path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
91dcad18d3 bcachefs: Pin btree cache in ram for random access in fsck
Various phases of fsck involve checking references from one btree to
another: this means doing a sequential scan of one btree, and then
mostly random access into the second.

This is particularly painful for checking extents <-> backpointers; we
can prefetch btree node access on the sequential scan, but not on the
random access portion, and this is particularly painful on spinning
rust, where we'd like to keep the pipeline fairly full of btree node
reads so that the elevator can reduce seeking.

This patch implements prefetching and pinning of the portion of the
btree that we'll be doing random access to. We already calculate how
much of the random access btree will fit in memory so it's a fairly
straightforward change.

This will put more pressure on system memory usage, so we introduce a
new option, fsck_memory_usage_percent, which is the percentage of total
system ram that fsck is allowed to pin.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
835cd3e147 bcachefs: Check for subvolume children when deleting subvolumes
Recursively destroying subvolumes isn't allowed yet.

Fixes: https://github.com/koverstreet/bcachefs/issues/634
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
b26d79147f bcachefs: BTREE_ID_subvolume_children
Add a btree to record a parent -> child subvolume relationships,
according to the filesystem heirarchy.

The subvolume_children btree is a bitset btree: if a bit is set at pos
p, that means p.offset is a child of subvolume p.inode.

This will be used for efficiently listing subvolumes, as well as
recursive deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
b8628a2529 bcachefs: bch_subvolume::fs_path_parent
Record the filesystem path heirarchy for subvolumes in bch_subvolume

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
e07c28ab92 bcachefs: bch2_btree_bit_mod()
Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for
the subvolume children btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
506b187603 bcachefs: bch2_btree_bit_mod -> bch2_btree_bit_mod_buffered
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
56e230473d bcachefs: Correctly reattach subvolumes
Subvolumes need special handling to reattach - we always reattach them
in the root subvolume's lost+found, and they need a slightly different
kind of dirent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
3a136177f3 bcachefs: check_path() now prints full inode when reattaching
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
688a769409 bcachefs: Pass inode bkey to check_path()
prep work for improving logging/error messages

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
f5d58d0c72 bcachefs: Fix path where dirent -> subvol missing and we don't fix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00