83943 Commits

Author SHA1 Message Date
Linus Torvalds
42aadec8c7 stat: remove no-longer-used helper macros
The choose_32_64() macros were added to deal with an odd inconsistency
between the 32-bit and 64-bit layout of 'struct stat' way back when in
commit a52dd971f947 ("vfs: de-crapify "cp_new_stat()" function").

Then a decade later Mikulas noticed that said inconsistency had been a
mistake in the early x86-64 port, and shouldn't have existed in the
first place.  So commit 932aba1e1690 ("stat: fix inconsistency between
struct stat and struct compat_stat") removed the uses of the helpers.

But the helpers remained around, unused.

Get rid of them.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-17 10:46:12 -07:00
Linus Torvalds
45c3c62722 three small SMB3 client fixes, one to improve a null check and two minor cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUGFSgACgkQiiy9cAdy
 T1H8ZQwAuJhiLsTJK0lnWWxZC+KsIvTXlNKqx3VUqhJeYdxAc1tNCVjHTgdm63QA
 gRA0Htt8UhUoVVIMiipW2/PHA4rrNU7i0ULXSasAL6d8pPuZfeCzoehSfFo4u2ra
 bVDjfQUDtRakSU//Aj+Bv2sO77UWz0pQ5y0v2LCpPQ9Ks5TmLgxT+40uXCXf/LAe
 3aBbvrgLOlt0JMXaIEaQoecMitUqajmuuq/5SVQ7lz0xvn7cCLKgk22LehtwHR0W
 Ae8GdCkfFipdq+gp76CZPHO9evmRCsjmF95z56/++HdLrftYln5W/TDfjTlOZM9V
 tP99hK/2EjsWL7TMCOG59w21sKuaOdBA7AV7blgWxZAbKsrBgtMEXgQxSZMiK+Vm
 lKR5lGLWoujQLcnzWRh+WL7XP0ZxzitTlrlLeFxciPSGP843GRx+0oINLKL8CInr
 9mTwkzlzODNKA+83yRs5+Q3i0mq161IugsRrk1NHRUsr7oXiWWIxhcqCy5N5+R2S
 SfB16ql5
 =WtnH
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Three small SMB3 client fixes, one to improve a null check and two
  minor cleanups"

* tag '6.6-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix some minor typos and repeated words
  smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP
  smb3: move server check earlier when setting channel sequence number
2023-09-17 10:41:42 -07:00
Linus Torvalds
39e0c8afdc two ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUGEycACgkQiiy9cAdy
 T1FJQwv/XxgRr8cCkGvoxJW+P969Aw/PMy4BhhL1s26zm85wEudfUldV19VeJad0
 4Jcc7w608pCAhR9Xsl3MLB437ttOID9jOK9G10RvqwAJSaCqO6whpXRaJTGe4g4n
 N71/18R2uPW3HswBe5KSk0pC6WtYaGDTDpJ7hV1zAwiQWQywMA9FgzOspQCRaBEf
 JqH9F8tnzpfT/lDTe64Q3mRXC1ppO/XdJHhxzgRv8l41bc/0PWCPk2GuxpMQNhCo
 BediiKyIa/kUPEDID9k02VVxoW+aitbcw5kYUfMO55V6IkstuDbjq5k7k+r0BKfK
 AM8YE/LyRM5izwaV73tS2mSVZlEQLSlfwAuAY5uvcnanUIegFypCHEclnNmkS3Qx
 dXAonMWGD4+8N/aywNg5Zm5ql3HzLzS4uCIVJbyeOLqd1GljaYjvWsGkXvY9NnyT
 ED5ya4jTFZeEbONEdnPcgmOEZifs93VnklCsaGMFRJbv1gnKsBOt75EooeB7+44j
 TyRaeNNe
 =MOeW
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc1-ksmbd' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Two ksmbd server fixes"

* tag '6.6-rc1-ksmbd' of git://git.samba.org/ksmbd:
  ksmbd: fix passing freed memory 'aux_payload_buf'
  ksmbd: remove unneeded mark_inode_dirty in set_info_sec()
2023-09-17 10:38:01 -07:00
Linus Torvalds
3fde3003ca Regression and bug fixes for ext4.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmUGh1YACgkQ8vlZVpUN
 gaN9lQgAqmMWu3xLwOERgVbK3CYT8WMcv0m9/by+vSwghCoPVDWWENgEgAzo4YpK
 Lsp4q62wHaWs6AzvJEaJ8ryedo7e4FUHxcvp2f6dCuOPadOEZZZTa4G5fAr0kYXS
 TIoaFtv6F2QVnGU6Y5lhtfYzmgLRdLL0B6MfSTYGO2MSREqxapvfxyGBQdkOuXfO
 UEtrUUEqQ2GdDcKp+FRRnaUvNaTPEESY8d5eVwrMmyUhQWUQL/N2BPbFkk1TP6RG
 MLDNsUZpdhZvLs6qLuR7dvO5wa2fshvRJIXlPINM0R0as5LmHqVL/ifCNkCn4W+k
 ZNvdSPhqew68KHHq3sYFtm9rbZ3YOA==
 =DopS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Regression and bug fixes for ext4"

* tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix rec_len verify error
  ext4: do not let fstrim block system suspend
  ext4: move setting of trimmed bit into ext4_try_to_trim_range()
  jbd2: Fix memory leak in journal_init_common()
  jbd2: Remove page size assumptions
  buffer: Make bh_offset() work for compound pages
2023-09-17 10:33:53 -07:00
Linus Torvalds
d8d7cd6563 nfsd-6.6 fixes:
- Use correct order when encoding NFSv4 RENAME change_info
 - Fix a potential oops during NFSD shutdown
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmUEf90ACgkQM2qzM29m
 f5dcrhAAjz3SLoz4nAlo441JMATp/auwqTBKlWvNUWbK/ZgYjUcTXH8RfPbBQ/Yl
 5NfsThubmPTSr/bNvV+C963C4c/25d45SYwabAdLoZWKIpWf1mCDNhE1JkGQoM83
 KyZcrLnEknh/fN6jAxueFhWPCaavGDK8eI8H6Ls33GvX0cG4S7IYseGZ1x2dMWpE
 XFnV8oAWSD3cjKrTGwPYja2W0ShISGn4UMnj+lPWEpfCDydnQzKh87dhWzvlAxCf
 4Ckikmv3vcCfm15Ja0JZX3K1cvh0A3oISbgN/QNqGpzDzQ1qyugsx9uNltDa76N8
 NzGENnvO2/WURkw4DHJYIsiNt82uB04NYSAL6mLat3GU4DeAKOl3r+9W0jZPFS17
 7mOLh9dqg3ubSxOr6BpeY4I6+Uq2enUTwh1+VkvC7hQ906ZY8kzaj5MtLddvkjCw
 p4BTDzH280AwtnwiJ7q0WvBR5TxdGc3fqqsLJ+yMX2aGdhM1iMOjjENoSc3+v+Cy
 pEK1tju/lDzcMZrl5A08Za30L5boHp11n5SLia7tctKiFs4biL3WwNu527Y/Klq6
 04DtkcXdF/tW702388sKL1UnnXhw4KA8k7Us4HrQL+zXDZ+/UJ+AzY6s8ls2ytbN
 0NESe/ntsKjoijrRphAp+RgC8fhC7iT0GMJP8OHyHUqEOU2Wrbw=
 =ZTRy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Use correct order when encoding NFSv4 RENAME change_info

 - Fix a potential oops during NFSD shutdown

* tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: fix possible oops when nfsd/pool_stats is closed.
  nfsd: fix change_info in NFSv4 RENAME replies
2023-09-15 16:48:44 -07:00
Linus Torvalds
e42bebf6db First set of EFI fixes for v6.6:
- Missing x86 patch for the runtime cleanup that was merged in -rc1
 - Kconfig tweak for kexec on x86 so EFI support does not get disabled
   inadvertently
 - Use the right EFI memory type for the unaccepted memory table so
   kexec/kdump exposes it to the crash kernel as well
 - Work around EFI implementations which do not implement
   QueryVariableInfo, which is now called by statfs() on efivarfs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZQGDxwAKCRAwbglWLn0t
 XApcAP4+Fv6orQy4h/nmkDhWJa8vg36gWu1CDmy/abo0v3ODZQD/duk7Ejqw5vAm
 kvFmxLheNVs/RmgZtmB7CugAqibTkAE=
 =EdJu
 -----END PGP SIGNATURE-----

Merge tag 'efi-fixes-for-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - Missing x86 patch for the runtime cleanup that was merged in -rc1

 - Kconfig tweak for kexec on x86 so EFI support does not get disabled
   inadvertently

 - Use the right EFI memory type for the unaccepted memory table so
   kexec/kdump exposes it to the crash kernel as well

 - Work around EFI implementations which do not implement
   QueryVariableInfo, which is now called by statfs() on efivarfs

* tag 'efi-fixes-for-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efivarfs: fix statfs() on efivarfs
  efi/unaccepted: Use ACPI reclaim memory for unaccepted memory table
  efi/x86: Ensure that EFI_RUNTIME_MAP is enabled for kexec
  efi/x86: Move EFI runtime call setup/teardown helpers out of line
2023-09-15 12:42:48 -07:00
Olga Kornievskaia
4506f23e11 NFSv4.1: fix zero value filehandle in post open getattr
Currently, if the OPEN compound experiencing an error and needs to
get the file attributes separately, it will send a stand alone
GETATTR but it would use the filehandle from the results of
the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh
is zero value. That generate a GETATTR that's sent with a zero
value filehandle, and results in the server returning an error.

Instead, for the CLAIM_FH OPEN, take the filehandle that was used
in the PUTFH of the OPEN compound.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-15 10:50:47 -04:00
Steve French
2c75426c1f smb3: fix some minor typos and repeated words
Minor cleanup pointed out by checkpatch (repeated words, missing blank
lines) in smb2pdu.c and old header location referred to in transport.c

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-15 01:37:33 -05:00
Steve French
ebc3d4e44a smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP
checkpatch flagged a few places with:
     WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP
Also fixed minor typo

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-15 01:10:40 -05:00
Filipe Manana
8e7f82deb0 btrfs: fix race between reading a directory and adding entries to it
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
are not holding the directory's inode locked, and this can result in later
attempting to add two entries to the directory with the same index number,
resulting in a transaction abort, with -EEXIST (-17), when inserting the
second delayed dir index. This results in a trace like the following:

  Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
  Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
  Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
  Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
  Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
  Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
  Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
  Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
  Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
  Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
  Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
  Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
  Sep 11 22:34:59 myhostname kernel: FS:  00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
  Sep 11 22:34:59 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
  Sep 11 22:34:59 myhostname kernel: Call Trace:
  Sep 11 22:34:59 myhostname kernel:  <TASK>
  Sep 11 22:34:59 myhostname kernel:  ? die+0x36/0x90
  Sep 11 22:34:59 myhostname kernel:  ? do_trap+0xda/0x100
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? do_error_trap+0x6a/0x90
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? exc_invalid_op+0x50/0x70
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? asm_exc_invalid_op+0x1a/0x20
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  btrfs_insert_dir_item+0x200/0x280
  Sep 11 22:34:59 myhostname kernel:  btrfs_add_link+0xab/0x4f0
  Sep 11 22:34:59 myhostname kernel:  ? ktime_get_real_ts64+0x47/0xe0
  Sep 11 22:34:59 myhostname kernel:  btrfs_create_new_inode+0x7cd/0xa80
  Sep 11 22:34:59 myhostname kernel:  btrfs_symlink+0x190/0x4d0
  Sep 11 22:34:59 myhostname kernel:  ? schedule+0x5e/0xd0
  Sep 11 22:34:59 myhostname kernel:  ? __d_lookup+0x7e/0xc0
  Sep 11 22:34:59 myhostname kernel:  vfs_symlink+0x148/0x1e0
  Sep 11 22:34:59 myhostname kernel:  do_symlinkat+0x130/0x140
  Sep 11 22:34:59 myhostname kernel:  __x64_sys_symlinkat+0x3d/0x50
  Sep 11 22:34:59 myhostname kernel:  do_syscall_64+0x5d/0x90
  Sep 11 22:34:59 myhostname kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
  Sep 11 22:34:59 myhostname kernel:  ? do_syscall_64+0x6c/0x90
  Sep 11 22:34:59 myhostname kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc

The race leading to the problem happens like this:

1) Directory inode X is loaded into memory, its ->index_cnt field is
   initialized to (u64)-1 (at btrfs_alloc_inode());

2) Task A is adding a new file to directory X, holding its vfs inode lock,
   and calls btrfs_set_inode_index() to get an index number for the entry.

   Because the inode's index_cnt field is set to (u64)-1 it calls
   btrfs_inode_delayed_dir_index_count() which fails because no dir index
   entries were added yet to the delayed inode and then it calls
   btrfs_set_inode_index_count(). This functions finds the last dir index
   key and then sets index_cnt to that index value + 1. It found that the
   last index key has an offset of 100. However before it assigns a value
   of 101 to index_cnt...

3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
   lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
   and sees index_cnt still with a value of (u64)-1. Because of that it
   calls btrfs_inode_delayed_dir_index_count() which fails since no dir
   index entries were added to the delayed inode yet, and then it also
   calls btrfs_set_inode_index_count(). This also finds that the last
   index key has an offset of 100, and before it assigns the value 101
   to the index_cnt field of inode X...

4) Task A assigns a value of 101 to index_cnt. And then the code flow
   goes to btrfs_set_inode_index() where it increments index_cnt from
   101 to 102. Task A then creates a delayed dir index entry with a
   sequence number of 101 and adds it to the delayed inode;

5) Task B assigns 101 to the index_cnt field of inode X;

6) At some later point when someone tries to add a new entry to the
   directory, btrfs_set_inode_index() will return 101 again and shortly
   after an attempt to add another delayed dir index key with index
   number 101 will fail with -EEXIST resulting in a transaction abort.

Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
only used when opening a directory or attempting to lseek on it.

Reported-by: ken <ken@bllue.org>
Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
e60aa5da14 btrfs: refresh dir last index during a rewinddir(3) call
When opening a directory we find what's the index of its last entry and
then store it in the directory's file handle private data (struct
btrfs_file_private::last_index), so that in the case new directory entries
are added to a directory after an opendir(3) call we don't end up in an
infinite loop (see commit 9b378f6ad48c ("btrfs: fix infinite directory
reads")) when calling readdir(3).

However once rewinddir(3) is called, POSIX states [1] that any new
directory entries added after the previous opendir(3) call, must be
returned by subsequent calls to readdir(3):

  "The rewinddir() function shall reset the position of the directory
   stream to which dirp refers to the beginning of the directory.
   It shall also cause the directory stream to refer to the current
   state of the corresponding directory, as a call to opendir() would
   have done."

We currently don't refresh the last_index field of the struct
btrfs_file_private associated to the directory, so after a rewinddir(3)
we are not returning any new entries added after the opendir(3) call.

Fix this by finding the current last index of the directory when llseek
is called against the directory.

This can be reproduced by the following C program provided by Ian Johnson:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     rewinddir(dir);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

Reported-by: Ian Johnson <ian@ianjohnson.dev>
Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
357950361c btrfs: set last dir index to the current last index when opening dir
When opening a directory for reading it, we set the last index where we
stop iteration to the value in struct btrfs_inode::index_cnt. That value
does not match the index of the most recently added directory entry but
it's instead the index number that will be assigned the next directory
entry.

This means that if after the call to opendir(3) new directory entries are
added, a readdir(3) call will return the first new directory entry. This
is fine because POSIX says the following [1]:

  "If a file is removed from or added to the directory after the most
   recent call to opendir() or rewinddir(), whether a subsequent call to
   readdir() returns an entry for that file is unspecified."

For example for the test script from commit 9b378f6ad48c ("btrfs: fix
infinite directory reads"), where we have 2000 files in a directory, ext4
doesn't return any new directory entry after opendir(3), while xfs returns
the first 13 new directory entries added after the opendir(3) call.

If we move to a shorter example with an empty directory when opendir(3) is
called, and 2 files added to the directory after the opendir(3) call, then
readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
files (but in a different order). A test program for this, reported by
Ian Johnson, is the following:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

To make this less odd, change the behaviour to never return new entries
that were added after the opendir(3) call. This is done by setting the
last_index field of the struct btrfs_file_private attached to the
directory's file handle with a value matching btrfs_inode::index_cnt
minus 1, since that value always matches the index of the next new
directory entry and not the index of the most recently added entry.

[1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html

Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Shida Zhang
7fda67e8c3 ext4: fix rec_len verify error
With the configuration PAGE_SIZE 64k and filesystem blocksize 64k,
a problem occurred when more than 13 million files were directly created
under a directory:

EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt?  Run e2fsck -D.
EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt?  Run e2fsck -D.
EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum

When enough files are created, the fake_dirent->reclen will be 0xffff.
it doesn't equal to the blocksize 65536, i.e. 0x10000.

But it is not the same condition when blocksize equals to 4k.
when enough files are created, the fake_dirent->reclen will be 0x1000.
it equals to the blocksize 4k, i.e. 0x1000.

The problem seems to be related to the limitation of the 16-bit field
when the blocksize is set to 64k.
To address this, helpers like ext4_rec_len_{from,to}_disk has already
been introduced to complete the conversion between the encoded and the
plain form of rec_len.

So fix this one by using the helper, and all the other in this file too.

Cc: stable@kernel.org
Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:07:07 -04:00
Jan Kara
5229a658f6 ext4: do not let fstrim block system suspend
Len Brown has reported that system suspend sometimes fail due to
inability to freeze a task working in ext4_trim_fs() for one minute.
Trimming a large filesystem on a disk that slowly processes discard
requests can indeed take a long time. Since discard is just an advisory
call, it is perfectly fine to interrupt it at any time and the return
number of discarded blocks until that moment. Do that when we detect the
task is being frozen.

Cc: stable@kernel.org
Reported-by: Len Brown <lenb@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=216322
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:59 -04:00
Jan Kara
45e4ab320c ext4: move setting of trimmed bit into ext4_try_to_trim_range()
Currently we set the group's trimmed bit in ext4_trim_all_free() based
on return value of ext4_try_to_trim_range(). However when we will want
to abort trimming because of suspend attempt, we want to return success
from ext4_try_to_trim_range() but not set the trimmed bit. Instead
implementing awkward propagation of this information, just move setting
of trimmed bit into ext4_try_to_trim_range() when the whole group is
trimmed.

Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:47 -04:00
Li Zetao
1bb0763f1e jbd2: Fix memory leak in journal_init_common()
There is a memory leak reported by kmemleak:

  unreferenced object 0xff11000105903b80 (size 64):
    comm "mount", pid 3382, jiffies 4295032021 (age 27.826s)
    hex dump (first 32 bytes):
      04 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
      ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<ffffffffae86ac40>] __kmalloc_node+0x50/0x160
      [<ffffffffaf2486d8>] crypto_alloc_tfmmem.isra.0+0x38/0x110
      [<ffffffffaf2498e5>] crypto_create_tfm_node+0x85/0x2f0
      [<ffffffffaf24a92c>] crypto_alloc_tfm_node+0xfc/0x210
      [<ffffffffaedde777>] journal_init_common+0x727/0x1ad0
      [<ffffffffaede1715>] jbd2_journal_init_inode+0x2b5/0x500
      [<ffffffffaed786b5>] ext4_load_and_init_journal+0x255/0x2440
      [<ffffffffaed8b423>] ext4_fill_super+0x8823/0xa330
      ...

The root cause was traced to an error handing path in journal_init_common()
when malloc memory failed in register_shrinker(). The checksum driver is
used to reference to checksum algorithm via cryptoapi and the user should
release the memory when the driver is no longer needed or the journal
initialization failed.

Fix it by calling crypto_free_shash() on the "err_cleanup" error handing
path in journal_init_common().

Fixes: c30713084ba5 ("jbd2: move load_superblock() into journal_init_common()")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230911025138.983101-1-lizetao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:22 -04:00
Linus Torvalds
99214f6778 Tracing fixes for 6.6:
- Add missing LOCKDOWN checks for eventfs callers
   When LOCKDOWN is active for tracing, it causes inconsistent state
   when some functions succeed and others fail.
 
 - Use dput() to free the top level eventfs descriptor
   There was a race between accesses and freeing it.
 
 - Fix a long standing bug that eventfs exposed due to changing timings
   by dynamically creating files. That is, If a event file is opened
   for an instance, there's nothing preventing the instance from being
   removed which will make accessing the files cause use-after-free bugs.
 
 - Fix a ring buffer race that happens when iterating over the ring
   buffer while writers are active. Check to make sure not to read
   the event meta data if it's beyond the end of the ring buffer sub buffer.
 
 - Fix the print trigger that disappeared because the test to create it
   was looking for the event dir field being filled, but now it has the
   "ef" field filled for the eventfs structure.
 
 - Remove the unused "dir" field from the event structure.
 
 - Fix the order of the trace_dynamic_info as it had it backwards for the
   offset and len fields for which one was for which endianess.
 
 - Fix NULL pointer dereference with eventfs_remove_rec()
   If an allocation fails in one of the eventfs_add_*() functions,
   the caller of it in event_subsystem_dir() or event_create_dir()
   assigns the result to the structure. But it's assigning the ERR_PTR
   and not NULL. This was passed to eventfs_remove_rec() which expects
   either a good pointer or a NULL, not ERR_PTR. The fix is to not
   assign the ERR_PTR to the structure, but to keep it NULL on error.
 
 - Fix list_for_each_rcu() to use list_for_each_srcu() in
   dcache_dir_open_wrapper(). One iteration of the code used RCU
   but because it had to call sleepable code, it had to be changed
   to use SRCU, but one of the iterations was missed.
 
 - Fix synthetic event print function to use "as_u64" instead of
   passing in a pointer to the union. To fix big/little endian issues,
   the u64 that represented several types was turned into a union to
   define the types properly.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQCvoBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtgrAP9MiYiCMU+90oJ+61DFchbs3y7BNidP
 s3lLRDUMJ935NQD/SSAm54PqWb+YXMpD7m9+3781l6xqwfabBMXNaEl+FwA=
 =tlZu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Add missing LOCKDOWN checks for eventfs callers

   When LOCKDOWN is active for tracing, it causes inconsistent state
   when some functions succeed and others fail.

 - Use dput() to free the top level eventfs descriptor

   There was a race between accesses and freeing it.

 - Fix a long standing bug that eventfs exposed due to changing timings
   by dynamically creating files. That is, If a event file is opened for
   an instance, there's nothing preventing the instance from being
   removed which will make accessing the files cause use-after-free
   bugs.

 - Fix a ring buffer race that happens when iterating over the ring
   buffer while writers are active. Check to make sure not to read the
   event meta data if it's beyond the end of the ring buffer sub buffer.

 - Fix the print trigger that disappeared because the test to create it
   was looking for the event dir field being filled, but now it has the
   "ef" field filled for the eventfs structure.

 - Remove the unused "dir" field from the event structure.

 - Fix the order of the trace_dynamic_info as it had it backwards for
   the offset and len fields for which one was for which endianess.

 - Fix NULL pointer dereference with eventfs_remove_rec()

   If an allocation fails in one of the eventfs_add_*() functions, the
   caller of it in event_subsystem_dir() or event_create_dir() assigns
   the result to the structure. But it's assigning the ERR_PTR and not
   NULL. This was passed to eventfs_remove_rec() which expects either a
   good pointer or a NULL, not ERR_PTR. The fix is to not assign the
   ERR_PTR to the structure, but to keep it NULL on error.

 - Fix list_for_each_rcu() to use list_for_each_srcu() in
   dcache_dir_open_wrapper(). One iteration of the code used RCU but
   because it had to call sleepable code, it had to be changed to use
   SRCU, but one of the iterations was missed.

 - Fix synthetic event print function to use "as_u64" instead of passing
   in a pointer to the union. To fix big/little endian issues, the u64
   that represented several types was turned into a union to define the
   types properly.

* tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()
  tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper()
  tracing/synthetic: Print out u64 values properly
  tracing/synthetic: Fix order of struct trace_dynamic_info
  selftests/ftrace: Fix dependencies for some of the synthetic event tests
  tracing: Remove unused trace_event_file dir field
  tracing: Use the new eventfs descriptor for print trigger
  ring-buffer: Do not attempt to read past "commit"
  tracefs/eventfs: Free top level files on removal
  ring-buffer: Avoid softlockup in ring_buffer_resize()
  tracing: Have event inject files inc the trace array ref count
  tracing: Have option files inc the trace array ref count
  tracing: Have current_trace inc the trace array ref count
  tracing: Have tracing_max_latency inc the trace array ref count
  tracing: Increase trace array ref count on enable and filter files
  tracefs/eventfs: Use dput to free the toplevel events directory
  tracefs/eventfs: Add missing lockdown checks
  tracefs: Add missing lockdown check to tracefs_create_dir()
2023-09-13 11:30:11 -07:00
Josef Bacik
b595d25996 btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup.  This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.

The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate.  It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap.  We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.

The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.

We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect.  This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.

Fix this by not clearing uptodate when we have a write error.  The only
thing we should be doing in this case is setting the mapping error and
carrying on.  This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.

With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.

Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:41:07 +02:00
Bernd Schubert
9af86694fd btrfs: file_remove_privs needs an exclusive lock in direct io write
This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.

Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:41:03 +02:00
Matthew Wilcox (Oracle)
06ed09351b btrfs: convert btrfs_read_merkle_tree_page() to use a folio
Remove a number of hidden calls to compound_head() by using a folio
throughout.  Also follow core kernel coding style by adding the folio to
the page cache immediately after allocation instead of doing the read
first, then adding it to the page cache.  This ordering makes subsequent
readers block waiting for the first reader instead of duplicating the
work only to throw it away when they find out they lost the race.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:40:54 +02:00
Olga Kornievskaia
806a3bc421 NFSv4.1: fix pnfs MDS=DS session trunking
Currently, when GETDEVICEINFO returns multiple locations where each
is a different IP but the server's identity is same as MDS, then
nfs4_set_ds_client() finds the existing nfs_client structure which
has the MDS's max_connect value (and if it's 1), then the 1st IP
on the DS's list will get dropped due to MDS trunking rules. Other
IPs would be added as they fall under the pnfs trunking rules.

For the list of IPs the 1st goes thru calling nfs4_set_ds_client()
which will eventually call nfs4_add_trunk() and call into
rpc_clnt_test_and_add_xprt() which has the check for MDS trunking.
The other IPs (after the 1st one), would call rpc_clnt_add_xprt()
which doesn't go thru that check.

nfs4_add_trunk() is called when MDS trunking is happening and it
needs to enforce the usage of max_connect mount option of the
1st mount. However, this shouldn't be applied to pnfs flow.

Instead, this patch proposed to treat MDS=DS as DS trunking and
make sure that MDS's max_connect limit does not apply to the
1st IP returned in the GETDEVICEINFO list. It does so by
marking the newly created client with a new flag NFS_CS_PNFS
which then used to pass max_connect value to use into the
rpc_clnt_test_and_add_xprt() instead of the existing rpc
client's max_connect value set by the MDS connection.

For example, mount was done without max_connect value set
so MDS's rpc client has cl_max_connect=1. Upon calling into
rpc_clnt_test_and_add_xprt() and using rpc client's value,
the caller passes in max_connect value which is previously
been set in the pnfs path (as a part of handling
GETDEVICEINFO list of IPs) in nfs4_set_ds_client().

However, when NFS_CS_PNFS flag is not set and we know we
are doing MDS trunking, comparing a new IP of the same
server, we then set the max_connect value to the
existing MDS's value and pass that into
rpc_clnt_test_and_add_xprt().

Fixes: dc48e0abee24 ("SUNRPC enforce creation of no more than max_connect xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust
dd7d7ee3ba NFS/pNFS: Report EINVAL errors from connect() to the server
With IPv6, connect() can occasionally return EINVAL if a route is
unavailable. If this happens during I/O to a data server, we want to
report it using LAYOUTERROR as an inability to connect.

Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust
b11243f720 NFS: More fixes for nfs_direct_write_reschedule_io()
Ensure that all requests are put back onto the commit list so that they
can be rescheduled.

Fixes: 4daaeba93822 ("NFS: Fix nfs_direct_write_reschedule_io()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust
b193a78ddb NFS: Use the correct commit info in nfs_join_page_group()
Ensure that nfs_clear_request_commit() updates the correct counters when
it removes them from the commit list.

Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust
8982f7aff3 NFS: More O_DIRECT accounting fixes for error paths
If we hit a fatal error when retransmitting, we do need to record the
removal of the request from the count of written bytes.

Fixes: 031d73ed768a ("NFS: Fix O_DIRECT accounting of number of bytes read/written")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:10 -04:00
Trond Myklebust
7c6339322c NFS: Fix O_DIRECT locking issues
The dreq fields are protected by the dreq->lock.

Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:50:57 -04:00
Namjae Jeon
59d8d24f46 ksmbd: fix passing freed memory 'aux_payload_buf'
The patch e2b76ab8b5c9: "ksmbd: add support for read compound" leads
to the following Smatch static checker warning:

  fs/smb/server/smb2pdu.c:6329 smb2_read()
        warn: passing freed memory 'aux_payload_buf'

It doesn't matter that we're passing a freed variable because nbytes is
zero. This patch set "aux_payload_buf = NULL" to make smatch silence.

Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-13 10:21:05 -05:00
Namjae Jeon
e4e14095cc ksmbd: remove unneeded mark_inode_dirty in set_info_sec()
mark_inode_dirty will be called in notify_change().
This patch remove unneeded mark_inode_dirty in set_info_sec().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-13 10:21:05 -05:00
Wang Jianchao
8b010acb31 xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
In our production environment, we find that mounting a 500M /boot
which is umount cleanly needs ~6s. One cause is that ffs() is
used by xlog_write_log_records() to decide the buffer size. It
can cause a lot of small IO easily when xlog_clear_stale_blocks()
needs to wrap around the end of log area and log head block is
not power of two. Things are similar in xlog_find_verify_cycle().

The code is able to handed bigger buffer very well, we can use
roundup_pow_of_two() to replace ffs() directly to avoid small
and sychronous IOs.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Wang Jianchao <wangjc136@midea.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-09-13 10:38:20 +05:30
Chandan Babu R
1155b12edb xfs: fix out of bounds memory access in scrub
This is a quick fix for a few internal syzbot reports concerning an
 invalid memory access in the scrub code.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOgAKCRBKO3ySh0YR
 pkKbAQCKg0+VAqr2UuKT7PygRSUaLNybnMBHetDZyd1maEl7OQD7BGuM9AxwXWFp
 hL0Jq/HN5yeArrueGKMd0K3u1HRjJQE=
 =XwHc
 -----END PGP SIGNATURE-----

Merge tag 'fix-scrub-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix out of bounds memory access in scrub

This is a quick fix for a few internal syzbot reports concerning an
invalid memory access in the scrub code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-scrub-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: only call xchk_stats_merge after validating scrub inputs
2023-09-13 10:35:49 +05:30
Chandan Babu R
6ebb6500e5 xfs: disallow LARP on old fses
Before enabling logged xattrs, make sure the filesystem is new enough
 that it actually supports log incompat features.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pqc1AQD8hXUpatOY50TdRDI6qpKBWOEti7r+sXyq9bWM4QZFyAD/Zjx3aZ+R2u2g
 lsb1xLjekrh2DzToOFnvs4gd/nZd7Qw=
 =BxHQ
 -----END PGP SIGNATURE-----

Merge tag 'fix-larp-requirements-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: disallow LARP on old fses

Before enabling logged xattrs, make sure the filesystem is new enough
that it actually supports log incompat features.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-larp-requirements-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: require a relatively recent V5 filesystem for LARP mode
2023-09-13 10:33:27 +05:30
Chandan Babu R
abf7c8194f xfs: reload entire iunlink lists
This is the second part of correcting XFS to reload the incore unlinked
 inode list from the ondisk contents.  Whereas part one tackled failures
 from regular filesystem calls, this part takes on the problem of needing
 to reload the entire incore unlinked inode list on account of somebody
 loading an inode that's in the /middle/ of an unlinked list.  This
 happens during quotacheck, bulkstat, or even opening a file by handle.
 
 In this case we don't know the length of the list that we're reloading,
 so we don't want to create a new unbounded memory load while holding
 resources locked.  Instead, we'll target UNTRUSTED iget calls to reload
 the entire bucket.
 
 Note that this changes the definition of the incore unlinked inode list
 slightly -- i_prev_unlinked == 0 now means "not on the incore list".
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 ptU6AP48lONiOPzWvF1mXTnDosAtIDsfliMY+qVgVxrghqBFmwEAitHlOadpWonu
 yoQ3cnSqzfA4rKT5MQZCm2iIHH/LMgU=
 =Kp5V
 -----END PGP SIGNATURE-----

Merge tag 'fix-iunlink-list-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: reload entire iunlink lists

This is the second part of correcting XFS to reload the incore unlinked
inode list from the ondisk contents.  Whereas part one tackled failures
from regular filesystem calls, this part takes on the problem of needing
to reload the entire incore unlinked inode list on account of somebody
loading an inode that's in the /middle/ of an unlinked list.  This
happens during quotacheck, bulkstat, or even opening a file by handle.

In this case we don't know the length of the list that we're reloading,
so we don't want to create a new unbounded memory load while holding
resources locked.  Instead, we'll target UNTRUSTED iget calls to reload
the entire bucket.

Note that this changes the definition of the incore unlinked inode list
slightly -- i_prev_unlinked == 0 now means "not on the incore list".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-iunlink-list-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: make inode unlinked bucket recovery work with quotacheck
  xfs: reload entire unlinked bucket lists
  xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
2023-09-13 10:30:12 +05:30
Chandan Babu R
fffcdcc31f xfs: reload the last iunlink item
It turns out that there are some serious bugs in how xfs handles the
 unlinked inode lists.  Way back before 4.14, there was a bug where a ro
 mount of a dirty filesystem would recover the log bug neglect to purge
 the unlinked list.  This leads to clean unmounted filesystems with
 unlinked inodes.  Starting around 5.15, we also converted the codebase
 to maintain a doubly-linked incore unlinked list.  However, we never
 provided the ability to load the incore list from disk.  If someone
 tries to allocate an O_TMPFILE file on a clean fs with a pre-existing
 unlinked list or even deletes a file, the code will fail and the fs
 shuts down.
 
 This first part of the correction effort adds the ability to load the
 first inode in the bucket when unlinking a file; and to load the next
 inode in the list when inactivating (freeing) an inode.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 plJvAQC0s843w2nvluXlIE8P9nBqk2ht6zwNOJpiZbWnf0zeLAD/a6v0HVVLbGN5
 qHVd/abQ5QIW55Ybm3Qko6PKvV4Nlgo=
 =WcRN
 -----END PGP SIGNATURE-----

Merge tag 'fix-iunlink-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: reload the last iunlink item

It turns out that there are some serious bugs in how xfs handles the
unlinked inode lists.  Way back before 4.14, there was a bug where a ro
mount of a dirty filesystem would recover the log bug neglect to purge
the unlinked list.  This leads to clean unmounted filesystems with
unlinked inodes.  Starting around 5.15, we also converted the codebase
to maintain a doubly-linked incore unlinked list.  However, we never
provided the ability to load the incore list from disk.  If someone
tries to allocate an O_TMPFILE file on a clean fs with a pre-existing
unlinked list or even deletes a file, the code will fail and the fs
shuts down.

This first part of the correction effort adds the ability to load the
first inode in the bucket when unlinking a file; and to load the next
inode in the list when inactivating (freeing) an inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-iunlink-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: load uncached unlinked inodes into memory on demand
2023-09-13 10:25:00 +05:30
Chandan Babu R
b6c2b6378d xfs: fix EFI recovery livelocks
This series fixes a customer-reported transaction reservation bug
 introduced ten years ago that could result in livelocks during log
 recovery.  Log intent item recovery single-steps each step of a deferred
 op chain, which means that each step only needs to allocate one
 transaction's worth of space in the log, not an entire chain all at
 once.  This single-stepping is critical to unpinning the log tail since
 there's nobody else to do it for us.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pt1uAQCkc4vnjA7C1eUsCwqnzK/A9fstwTnmx7qlGGfFM7wwowD7BqQX2AAeYUvu
 iT4UzvG9kao+jNNr0zx+ddYOOTJcrgI=
 =H9nC
 -----END PGP SIGNATURE-----

Merge tag 'fix-efi-recovery-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix EFI recovery livelocks

This series fixes a customer-reported transaction reservation bug
introduced ten years ago that could result in livelocks during log
recovery.  Log intent item recovery single-steps each step of a deferred
op chain, which means that each step only needs to allocate one
transaction's worth of space in the log, not an entire chain all at
once.  This single-stepping is critical to unpinning the log tail since
there's nobody else to do it for us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-efi-recovery-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: reserve less log space when recovering log intent items
2023-09-13 10:22:26 +05:30
Chandan Babu R
f41d7d70b0 xfs: fix ro mounting with unknown rocompat features [v2]
Dave pointed out some failures in xfs/270 when he upgraded Debian
 unstable and util-linux started using the new mount apis.  Upon further
 inquiry I noticed that XFS is quite a hot mess when it encounters a
 filesystem with unrecognized rocompat bits set in the superblock.
 
 Whereas we used to allow readonly mounts under these conditions, a
 change to the sb write verifier several years ago resulted in the
 filesystem going down immediately because the post-mount log cleaning
 writes the superblock, which trips the sb write verifier on the
 unrecognized rocompat bit.  I made the observation that the ROCOMPAT
 features RMAPBT and REFLINK both protect new log intent item types,
 which means that we actually cannot support recovering the log if we
 don't recognize all the rocompat bits.
 
 Therefore -- fix inode inactivation to work when we're recovering the
 log, disallow recovery when there's unrecognized rocompat bits, and
 don't clean the log if doing so would trip the rocompat checks.
 
 v2: change direction of series to allow log recovery on ro mounts
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pkFRAP0f7+do6A3cs5GuMSCRdH3DImjX1ts9nHJAgxKadTod8gEApeDb290wI+ek
 NTetY6RKfexMZLEgXI8YtAlhsR8nVwI=
 =LARv
 -----END PGP SIGNATURE-----

Merge tag 'fix-ro-mounts-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix ro mounting with unknown rocompat features

Dave pointed out some failures in xfs/270 when he upgraded Debian
unstable and util-linux started using the new mount apis.  Upon further
inquiry I noticed that XFS is quite a hot mess when it encounters a
filesystem with unrecognized rocompat bits set in the superblock.

Whereas we used to allow readonly mounts under these conditions, a
change to the sb write verifier several years ago resulted in the
filesystem going down immediately because the post-mount log cleaning
writes the superblock, which trips the sb write verifier on the
unrecognized rocompat bit.  I made the observation that the ROCOMPAT
features RMAPBT and REFLINK both protect new log intent item types,
which means that we actually cannot support recovering the log if we
don't recognize all the rocompat bits.

Therefore -- fix inode inactivation to work when we're recovering the
log, disallow recovery when there's unrecognized rocompat bits, and
don't clean the log if doing so would trip the rocompat checks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-ro-mounts-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: fix log recovery when unknown rocompat bits are set
  xfs: allow inode inactivation during a ro mount log recovery
2023-09-13 10:14:02 +05:30
Chandan Babu R
0a229c935a xfs: fix cpu hotplug mess [v2]
Ritesh and Eric separately reported crashes in XFS's hook function for
 CPU hot remove if the remove event races with a filesystem being
 mounted.  I also noticed via generic/650 that once in a while the log
 will shut down over an apparent overrun of a transaction reservation;
 this turned out to be due to CIL percpu list aggregation failing to pick
 up the percpu list items from a dying CPU.
 
 Either way, the solution here is to eliminate the need for a CPU dying
 hook by using a private cpumask to track which CPUs have added to their
 percpu lists directly, and iterating with that mask.  This fixes the log
 problems and (I think) solves a theoretical UAF bug in the inodegc code
 too.
 
 v2: fix a few put_cpu uses, add necessary memory barriers, and use
     atomic cpumask operations
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 plauAQCV0RymbwD/ONbvpor3yK4R3YO1pa923KtoiQ9IAV5uswD/YBWvyI76BhNs
 B8hwbEDm3X2ZjQaikxI+Xx2cMaAhkgY=
 =EJ0b
 -----END PGP SIGNATURE-----

Merge tag 'fix-percpu-lists-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix cpu hotplug mess

Ritesh and Eric separately reported crashes in XFS's hook function for
CPU hot remove if the remove event races with a filesystem being
mounted.  I also noticed via generic/650 that once in a while the log
will shut down over an apparent overrun of a transaction reservation;
this turned out to be due to CIL percpu list aggregation failing to pick
up the percpu list items from a dying CPU.

Either way, the solution here is to eliminate the need for a CPU dying
hook by using a private cpumask to track which CPUs have added to their
percpu lists directly, and iterating with that mask.  This fixes the log
problems and (I think) solves a theoretical UAF bug in the inodegc code
too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-percpu-lists-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: remove CPU hotplug infrastructure
  xfs: remove the all-mounts list
  xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  xfs: fix per-cpu CIL structure aggregation racing with dying cpus
2023-09-13 10:11:44 +05:30
Chandan Babu R
da6f8410e7 xfs: fix fsmap cursor handling [v2]
This patchset addresses an integer overflow bug that Dave Chinner found
 in how fsmap handles figuring out where in the record set we left off
 when userspace calls back after the first call filled up all the
 designated record space.
 
 v2: add RVB tags
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChMwAKCRBKO3ySh0YR
 prqBAP9Zp2WxwQuNQLqCfXBRLZiJRiW8JFcTNJOjdqIicsOPYgEAxs1GHJU4ozrO
 bKyolvNJIjSow7LWYP1GmfCRa9FqwQ4=
 =3uSx
 -----END PGP SIGNATURE-----

Merge tag 'fix-fsmap-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix fsmap cursor handling

This patchset addresses an integer overflow bug that Dave Chinner found
in how fsmap handles figuring out where in the record set we left off
when userspace calls back after the first call filled up all the
designated record space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-fsmap-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: fix an agbno overflow in __xfs_getfsmap_datadev
2023-09-13 10:03:38 +05:30
Steve French
05d0f8f55a smb3: move server check earlier when setting channel sequence number
Smatch warning pointed out by Dan Carpenter:

    fs/smb/client/smb2pdu.c:105 smb2_hdr_assemble()
    warn: variable dereferenced before check 'server' (see line 95)

Fixes: 09ee7a3bf866 ("[SMB3] send channel sequence number in SMB3 requests after reconnects")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-12 14:21:32 -05:00
Linus Torvalds
3669558bdf for-6.6-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmT/hwAACgkQxWXV+ddt
 WDsn7hAAngwEMKEAH9Jvu/BtHgRYcAdsGh5Mxw34aQf1+DAaH03GGsZjN6hfHYo4
 FMsnnvoZD5VPfuaFaQVd+mS9mRzikm503W7KfZFAPAQTOjz50RZbohLnZWa3eFbI
 46OcpoHusxwoYosEmIAt+dcw/gDlT9fpj+W11dKYtwOEjCqGA/OeKoVenfk38hVJ
 r+XhLwZFf4dPIqE3Ht26UtJk87Xs2X0/LQxOX3vM1MZ+l38N4dyo7TQnwfTHlQNw
 AK9sK6vp3rpRR96rvTV1dWr9lnmE7wky+Vh36DN/jxpzbW7Wx8IVoobBpcsO4Tyk
 Vw/rdjB7g7LfBmjLFhWvvQ73jv0WjIUUzXH17RuxOeyAQJ9tXFztVMh+QoVVC/Ka
 NxwA5uqyJKR7DIA+kLL06abUnASUVgP6Krdv9Fk7rYCKWluWk1k9ls9XaFFhytvg
 eeno/UB0px1rwps5P5zfaSXLIXEl53Luy5rFhTMCCNQfXyo+Qe6PJyTafR3E0uP8
 aXJV1lPG+o7qi9Vwg+20yy//1sE5gR0dLrcTaup3/20RK6eljZ/bNSkl3GJR9mlS
 YF+J/Ccia06y8Qo0xaeCofxkoI3J/PK6KPOTt8yZDgYoetYgHhrfBRO0I7ZU4Edq
 10512hAeskzPt6+5348+/jOEENASffXKP3FJSdDEzWd33vtlaHE=
 =mHTa
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - several fixes for handling directory item (inserting, removing,
   iteration, error handling)

 - fix transaction commit stalls when auto relocation is running and
   blocks other tasks that want to commit

 - fix a build error when DEBUG is enabled

 - fix lockdep warning in inode number lookup ioctl

 - fix race when finishing block group creation

 - remove link to obsolete wiki in several files

* tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org
  btrfs: assert delayed node locked when removing delayed item
  btrfs: remove BUG() after failure to insert delayed dir index item
  btrfs: improve error message after failure to add delayed dir index item
  btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
  btrfs: check for BTRFS_FS_ERROR in pending ordered assert
  btrfs: fix lockdep splat and potential deadlock after failure running delayed items
  btrfs: do not block starts waiting on previous transaction commit
  btrfs: release path before inode lookup during the ino lookup ioctl
  btrfs: fix race between finishing block group creation and its item update
2023-09-12 11:28:00 -07:00
Darrick J. Wong
e031928200 xfs: only call xchk_stats_merge after validating scrub inputs
Harshit Mogalapalli slogged through several reports from our internal
syzbot instance and observed that they all had a common stack trace:

BUG: KASAN: user-memory-access in instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
BUG: KASAN: user-memory-access in atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline]
BUG: KASAN: user-memory-access in queued_spin_lock include/asm-generic/qspinlock.h:111 [inline]
BUG: KASAN: user-memory-access in do_raw_spin_lock include/linux/spinlock.h:187 [inline]
BUG: KASAN: user-memory-access in __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline]
BUG: KASAN: user-memory-access in _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154
Write of size 4 at addr 0000001dd87ee280 by task syz-executor365/1543

CPU: 2 PID: 1543 Comm: syz-executor365 Not tainted 6.5.0-syzk #1
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x83/0xb0 lib/dump_stack.c:106
 print_report+0x3f8/0x620 mm/kasan/report.c:478
 kasan_report+0xb0/0xe0 mm/kasan/report.c:588
 check_region_inline mm/kasan/generic.c:181 [inline]
 kasan_check_range+0x139/0x1e0 mm/kasan/generic.c:187
 instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
 atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:111 [inline]
 do_raw_spin_lock include/linux/spinlock.h:187 [inline]
 __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline]
 _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154
 spin_lock include/linux/spinlock.h:351 [inline]
 xchk_stats_merge_one.isra.1+0x39/0x650 fs/xfs/scrub/stats.c:191
 xchk_stats_merge+0x5f/0xe0 fs/xfs/scrub/stats.c:225
 xfs_scrub_metadata+0x252/0x14e0 fs/xfs/scrub/scrub.c:599
 xfs_ioc_scrub_metadata+0xc8/0x160 fs/xfs/xfs_ioctl.c:1646
 xfs_file_ioctl+0x3fd/0x1870 fs/xfs/xfs_ioctl.c:1955
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:871 [inline]
 __se_sys_ioctl fs/ioctl.c:857 [inline]
 __x64_sys_ioctl+0x199/0x220 fs/ioctl.c:857
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3e/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7ff155af753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc006e2568 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff155af753d
RDX: 00000000200000c0 RSI: 00000000c040583c RDI: 0000000000000003
RBP: 00000000ffffffff R08: 00000000004010c0 R09: 00000000004010c0
R10: 00000000004010c0 R11: 0000000000000246 R12: 0000000000400cb0
R13: 00007ffc006e2670 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

The root cause here is that xchk_stats_merge_one walks off the end of
the xchk_scrub_stats.cs_stats array because it has been fed a garbage
value in sm->sm_type.  That occurs because I put the xchk_stats_merge
in the wrong place -- it should have been after the last xchk_teardown
call on our way out of xfs_scrub_metadata because we only call the
teardown function if we called the setup function, and we don't call the
setup functions if the inputs are obviously garbage.

Thanks to Harshit for triaging the bug reports and bringing this to my
attention.

Fixes: d7a74cad8f45 ("xfs: track usage statistics of online fsck")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:08 -07:00
Darrick J. Wong
34389616a9 xfs: require a relatively recent V5 filesystem for LARP mode
While reviewing the FIEXCHANGE code in XFS, I realized that the function
that enables logged xattrs doesn't actually check that the superblock
has a LOG_INCOMPAT feature bit field.  Add a check to refuse the
operation if we don't have a V5 filesystem...

...but on second though, let's require either reflink or rmap so that we
only have to deal with LARP mode on relatively /modern/ kernel.  4.14 is
about as far back as I feel like going.

Seeing as LARP is a debugging-only option anyway, this isn't likely to
affect any real users.

Fixes: d9c61ccb3b09 ("xfs: move xfs_attr_use_log_assist out of xfs_log.c")
Really-Fixes: f3f36c893f26 ("xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
2023-09-12 10:31:08 -07:00
Darrick J. Wong
49813a21ed xfs: make inode unlinked bucket recovery work with quotacheck
Teach quotacheck to reload the unlinked inode lists when walking the
inode table.  This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
68b957f64f xfs: load uncached unlinked inodes into memory on demand
shrikanth hegde reports that filesystems fail shortly after mount with
the following failure:

	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]

This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:

	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }

From diagnostic data collected by the bug reporters, it would appear
that we cleanly mounted a filesystem that contained unlinked inodes.
Unlinked inodes are only processed as a final step of log recovery,
which means that clean mounts do not process the unlinked list at all.

Prior to the introduction of the incore unlinked lists, this wasn't a
problem because the unlink code would (very expensively) traverse the
entire ondisk metadata iunlink chain to keep things up to date.
However, the incore unlinked list code complains when it realizes that
it is out of sync with the ondisk metadata and shuts down the fs, which
is bad.

Ritesh proposed to solve this problem by unconditionally parsing the
unlinked lists at mount time, but this imposes a mount time cost for
every filesystem to catch something that should be very infrequent.
Instead, let's target the places where we can encounter a next_unlinked
pointer that refers to an inode that is not in cache, and load it into
cache.

Note: This patch does not address the problem of iget loading an inode
from the middle of the iunlink list and needing to set i_prev_unlinked
correctly.

Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
3c919b0910 xfs: reserve less log space when recovering log intent items
Wengang Wang reports that a customer's system was running a number of
truncate operations on a filesystem with a very small log.  Contention
on the reserve heads lead to other threads stalling on smaller updates
(e.g.  mtime updates) long enough to result in the node being rebooted
on account of the lack of responsivenes.  The node failed to recover
because log recovery of an EFI became stuck waiting for a grant of
reserve space.  From Wengang's report:

"For the file deletion, log bytes are reserved basing on
xfs_mount->tr_itruncate which is:

    tr_logres = 175488,
    tr_logcount = 2,
    tr_logflags = XFS_TRANS_PERM_LOG_RES,

"You see it's a permanent log reservation with two log operations (two
transactions in rolling mode).  After calculation (xlog_calc_unit_res()
adds space for various log headers), the final log space needed per
transaction changes from  175488 to 180208 bytes.  So the total log
space needed is 360416 bytes (180208 * 2).  [That quantity] of log space
(360416 bytes) needs to be reserved for both run time inode removing
(xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())."

In other words, runtime pre-reserves 360K of space in anticipation of
running a chain of two transactions in which each transaction gets a
180K reservation.

Now that we've allocated the transaction, we delete the bmap mapping,
log an EFI to free the space, and roll the transaction as part of
finishing the deferops chain.  Rolling creates a new xfs_trans which
shares its ticket with the old transaction.  Next, xfs_trans_roll calls
__xfs_trans_commit with regrant == true, which calls xlog_cil_commit
with the same regrant parameter.

xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
subtracts t_curr_res from the reservation and write heads.

If the filesystem is fresh and the first transaction only used (say)
20K, then t_curr_res will be 160K, and we give that much reservation
back to the reservation head.  Or if the file is really fragmented and
the first transaction actually uses 170K, then t_curr_res will be 10K,
and that's what we give back to the reservation.

Having done that, we're now headed into the second transaction with an
EFI and 180K of reservation.  Other threads apparently consumed all the
reservation for smaller transactions, such as timestamp updates.

Now let's say the first transaction gets written to disk and we crash
without ever completing the second transaction.  Now we remount the fs,
log recovery finds the unfinished EFI, and calls xfs_efi_recover to
finish the EFI.  However, xfs_efi_recover starts a new tr_itruncate
tranasction, which asks for 360K log reservation.  This is a lot more
than the 180K that we had reserved at the time of the crash.  If the
first EFI to be recovered is also pinning the tail of the log, we will
be unable to free any space in the log, and recovery livelocks.

Wengang confirmed this:

"Now we have the second transaction which has 180208 log bytes reserved
too. The second transaction is supposed to process intents including
extent freeing.  With my hacking patch, I blocked the extent freeing 5
hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved.

"With my test case, other transactions (update timestamps) then happen.
As my hacking patch pins the journal tail, those timestamp-updating
transactions finally use up (almost) all the left available log space
(in memory in on disk).  And finally the on disk (and in memory)
available log space goes down near to 180208 bytes.  Those 180208 bytes
are reserved by [the] second (extent-free) transaction [in the chain]."

Wengang and I noticed that EFI recovery starts a transaction, completes
one step of the chain, and commits the transaction without completing
any other steps of the chain.  Those subsequent steps are completed by
xlog_finish_defer_ops, which allocates yet another transaction to
finish the rest of the chain.  That transaction gets the same tr_logres
as the head transaction, but with tr_logcount = 1 to force regranting
with every roll to avoid livelocks.

In other words, we already figured this out in commit 929b92f64048d
("xfs: xfs_defer_capture should absorb remaining transaction
reservation"), but should have applied that logic to each intent item's
recovery function.  For Wengang's case, the xfs_trans_alloc call in the
EFI recovery function should only be asking for a single transaction's
worth of log reservation -- 180K, not 360K.

Quoting Wengang again:

"With log recovery, during EFI recovery, we use tr_itruncate again to
reserve two transactions that needs 360416 log bytes.  Reserving 360416
bytes fails [stalls] because we now only have about 180208 available.

"Actually during the EFI recover, we only need one transaction to free
the extents just like the 2nd transaction at RUNTIME.  So it only needs
to reserve 180208 rather than 360416 bytes.  We have (a bit) more than
180208 available log bytes on disk, so [if we decrease the reservation
to 180K] the reservation goes and the recovery [finishes].  That is to
say: we can fix the log recover part to fix the issue. We can introduce
a new xfs_trans_res xfs_mount->tr_ext_free

{
  tr_logres = 175488,
  tr_logcount = 0,
  tr_logflags = 0,
}

"and use tr_ext_free instead of tr_itruncate in EFI recover."

However, I don't think it quite makes sense to create an entirely new
transaction reservation type to handle single-stepping during log
recovery.  Instead, we should copy the transaction reservation
information in the xfs_mount, change tr_logcount to 1, and pass that
into xfs_trans_alloc.  We know this won't risk changing the min log size
computation since we always ask for a fraction of the reservation for
all known transaction types.

This looks like it's been lurking in the codebase since commit
3d3c8b5222b92, which changed the xfs_trans_reserve call in
xlog_recover_process_efi to use the tr_logcount in tr_itruncate.
That changed the EFI recovery transaction from making a
non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log
space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth.

Fixes: 3d3c8b5222b92 ("xfs: refactor xfs_trans_reserve() interface")
Complements: 929b92f64048d ("xfs: xfs_defer_capture should absorb remaining transaction reservation")
Suggested-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Srikanth C S <srikanth.c.s@oracle.com>
[djwong: apply the same transformation to all log intent recovery]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
74ad4693b6 xfs: fix log recovery when unknown rocompat bits are set
Log recovery has always run on read only mounts, even where the primary
superblock advertises unknown rocompat bits.  Due to a misunderstanding
between Eric and Darrick back in 2018, we accidentally changed the
superblock write verifier to shutdown the fs over that exact scenario.
As a result, the log cleaning that occurs at the end of the mounting
process fails if there are unknown rocompat bits set.

As we now allow writing of the superblock if there are unknown rocompat
bits set on a RO mount, we no longer want to turn off RO state to allow
log recovery to succeed on a RO mount.  Hence we also remove all the
(now unnecessary) RO state toggling from the log recovery path.

Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier"
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
83771c50e4 xfs: reload entire unlinked bucket lists
The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality.  It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list.  For example,
if at mount time the ondisk iunlink bucket looks like this:

AGI -> 7 -> 22 -> 3 -> NULL

None of these three inodes are cached in memory.  Now let's say that
someone tries to open inode 3 by handle.  We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
76e589013f xfs: allow inode inactivation during a ro mount log recovery
In the next patch, we're going to prohibit log recovery if the primary
superblock contains an unrecognized rocompat feature bit even on
readonly mounts.  This requires removing all the code in the log
mounting process that temporarily disables the readonly state.

Unfortunately, inode inactivation disables itself on readonly mounts.
Clearing the iunlinked lists after log recovery needs inactivation to
run to free the unreferenced inodes, which (AFAICT) is the only reason
why log mounting plays games with the readonly state in the first place.

Therefore, change the inactivation predicates to allow inactivation
during log recovery of a readonly mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
f12b96683d xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Christoph Hellwig
4aa8cdd5e5 iomap: handle error conditions more gracefully in iomap_to_bh
iomap_to_bh currently BUG()s when the passed in block number is not
in the iomap.  For file systems that have proper synchronization this
should never happen and so far hasn't in mainline, but for block devices
size changes aren't fully synchronized against ongoing I/O.  Instead
of BUG()ing in this case, return -EIO to the caller, which already has
proper error handling.  While we're at it, also return -EIO for an
unknown iomap state instead of returning garbage.

Fixes: 487c607df790 ("block: use iomap for writes to block devices")
Reported-by: syzbot+4a08ffdf3667b36650a1@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2023-09-12 10:05:48 -07:00
Linus Torvalds
afe03f088f overlayfs fixes for 6.6-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmT+s8AACgkQEVvVyTe/
 1Wr3GA/+IiTHYJSqHOjKvAWs8pVX3rcmJxoodPidOFAncPKzw5GlD7Lo0C8Qgmie
 CEzg8kSfMPg0n7VpywOCiA1n37gBJbmHdQ0108OxJA65IRna4K6qcyBQvfOGXq9u
 Qx360cTCJFGBITkkRmg8RZQKF+Dj3nd0vHn3feGkPftL113fhZTZ1uWFxyXIGsuu
 eki8wW3WgZFvS71Pp9nZEVFd/HPJukd03LKHQlSnQQBnCJJjZhvM6O2Y4o4lFznj
 aWTIHQdowz00Mj1kFEM46cGFDg3SwFtdOizpRrWxL0oOVnElIo8mRAxtf3gz4081
 fvvhvYG5QEgSUrq7VfuGxaxlh2tyHgc8gh7lNXC0JCGSjc2lHGosvniFcRo6ecgU
 7UwT+rX0odWzbSDH6TrMHPZsBSS/siKWreii63HUlMOo0mSorTzA20EK7f9qoXTA
 dgvMD7cbFiEOjwlS9SbIjMMuvYM5VykCyWuniBqicA5UzUj2/K5DG6apkMK/MGbn
 DU/r0HYROqdggk920i/Yyv4GoS6uERfELpkoJr9q7Lx1+wAkRGbNUpUe408Wp411
 I66Ynie48oBmlDfU3LiyW9b3OPbFMPKE3WTPIngJurWRoHFXunxdkArfQi+rAUpx
 cmC5CovFAaVSL3HyhWAXh5lMbe1KjUasQM8ywTyhki2wZzrAI7U=
 =F/OX
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs fixes from Amir Goldstein:
 "Two fixes for pretty old regressions"

* tag 'ovl-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: fix incorrect fdput() on aio completion
  ovl: fix failed copyup of fileattr on a symlink
2023-09-12 09:00:25 -07:00