92118 Commits

Author SHA1 Message Date
Baokun Li
522018a0de cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
We got the following issue in our fault injection stress test:

==================================================================
BUG: KASAN: slab-use-after-free in fscache_withdraw_volume+0x2e1/0x370
Read of size 4 at addr ffff88810680be08 by task ondemand-04-dae/5798

CPU: 0 PID: 5798 Comm: ondemand-04-dae Not tainted 6.8.0-dirty #565
Call Trace:
 kasan_check_range+0xf6/0x1b0
 fscache_withdraw_volume+0x2e1/0x370
 cachefiles_withdraw_volume+0x31/0x50
 cachefiles_withdraw_cache+0x3ad/0x900
 cachefiles_put_unbind_pincount+0x1f6/0x250
 cachefiles_daemon_release+0x13b/0x290
 __fput+0x204/0xa00
 task_work_run+0x139/0x230

Allocated by task 5820:
 __kmalloc+0x1df/0x4b0
 fscache_alloc_volume+0x70/0x600
 __fscache_acquire_volume+0x1c/0x610
 erofs_fscache_register_volume+0x96/0x1a0
 erofs_fscache_register_fs+0x49a/0x690
 erofs_fc_fill_super+0x6c0/0xcc0
 vfs_get_super+0xa9/0x140
 vfs_get_tree+0x8e/0x300
 do_new_mount+0x28c/0x580
 [...]

Freed by task 5820:
 kfree+0xf1/0x2c0
 fscache_put_volume.part.0+0x5cb/0x9e0
 erofs_fscache_unregister_fs+0x157/0x1b0
 erofs_kill_sb+0xd9/0x1c0
 deactivate_locked_super+0xa3/0x100
 vfs_get_super+0x105/0x140
 vfs_get_tree+0x8e/0x300
 do_new_mount+0x28c/0x580
 [...]
==================================================================

Following is the process that triggers the issue:

        mount failed         |         daemon exit
------------------------------------------------------------
 deactivate_locked_super        cachefiles_daemon_release
  erofs_kill_sb
   erofs_fscache_unregister_fs
    fscache_relinquish_volume
     __fscache_relinquish_volume
      fscache_put_volume(fscache_volume, fscache_volume_put_relinquish)
       zero = __refcount_dec_and_test(&fscache_volume->ref, &ref);
                                 cachefiles_put_unbind_pincount
                                  cachefiles_daemon_unbind
                                   cachefiles_withdraw_cache
                                    cachefiles_withdraw_volumes
                                     list_del_init(&volume->cache_link)
       fscache_free_volume(fscache_volume)
        cache->ops->free_volume
         cachefiles_free_volume
          list_del_init(&cachefiles_volume->cache_link);
        kfree(fscache_volume)
                                     cachefiles_withdraw_volume
                                      fscache_withdraw_volume
                                       fscache_volume->n_accesses
                                       // fscache_volume UAF !!!

The fscache_volume in cache->volumes must not have been freed yet, but its
reference count may be 0. So use the new fscache_try_get_volume() helper
function try to get its reference count.

If the reference count of fscache_volume is 0, fscache_put_volume() is
freeing it, so wait for it to be removed from cache->volumes.

If its reference count is not 0, call cachefiles_withdraw_volume() with
reference count protection to avoid the above issue.

Fixes: fe2140e2f57f ("cachefiles: Implement volume support")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240628062930.2467993-3-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03 10:36:14 +02:00
Baokun Li
85b08b31a2 netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
Export fscache_put_volume() and add fscache_try_get_volume()
helper function to allow cachefiles to get/put fscache_volume
via linux/fscache-cache.h.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240628062930.2467993-2-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03 10:36:14 +02:00
David Howells
08f70c0a93 cifs: Fix read-performance regression by dropping readahead expansion
cifs_expand_read() is causing a performance regression of around 30% by
causing extra pagecache to be allocated for an inode in the readahead path
before we begin actually dispatching RPC requests, thereby delaying the
actual I/O.  The expansion is sized according to the rsize parameter, which
seems to be 4MiB on my test system; this is a big step up from the first
requests made by the fio test program.

Simple repro (look at read bandwidth number):
     fio --name=writetest --filename=/xfstest.test/foo --time_based --runtime=60 --size=16M --numjobs=1 --rw=read

Fix this by removing cifs_expand_readahead().  Readahead expansion is
mostly useful for when we're using the local cache if the local cache has a
block size greater than PAGE_SIZE, so we can dispense with it when not
caching.

Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-02 21:23:41 -05:00
Linus Torvalds
dbd8132ace vfs-6.10-rc7.fixes.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZoRT7gAKCRCRxhvAZXjc
 opNXAP9dXdBWK0LqpLlN5Y01UQ8Kd7AqFCAEvFL5SCX3U3dJ8QD+IZbzIM2qhPJJ
 f2gVyw7drWTfqJvWhzFch616QyGVNwA=
 =daVI
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Improve handling of deep ancestor chains in is_subdir()

   - Release locks cleanly when fctnl_setlk() races with close().

     When setting a file lock fails the VFS tries to cleanup the already
     created lock. The helper used for this calls back into the LSM
     layer which may cause it to fail, leaving the stale lock accessible
     via /proc/locks.

  AFS:

   - Fix a comma/semicolon typo"

* tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Convert comma to semicolon
  fs: better handle deep ancestor chains in is_subdir()
  filelock: Remove locks reliably when fcntl/close race is detected
2024-07-02 13:43:02 -07:00
Chen Ni
655593a40e
afs: Convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn/
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 21:23:00 +02:00
Christian Brauner
391b59b045
fs: better handle deep ancestor chains in is_subdir()
Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.

Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.

A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 21:18:32 +02:00
Linus Torvalds
734610514c Changes since last update:
- Fix possible global buffer memory leak when unloading EROFS module;
 
  - Fix FS_IOC_GETFSUUID ioctl by using super_set_uuid();
 
  - Reset m_llen to 0 so then it can retry if metadata is invalid.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmaEAJwRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qoaSg//cHQSVWZ5joXjTEsSk5a1cROsEOz6I8PG
 lY0jrYNxk/xrAapdsr3SzJ+5evz2f3bsKNAy6qHk1LUQwZOMoOWxVZfMmUjirGHy
 GncsEbD1XPgW7W/GdbYdg1/fxvBd7z88G5JNv62F07e8MEWWme1YLgV2AykTgWux
 FuzUPdunJj10NA235530GWcwsTf2GC774HXIiJdbh+l+iTfC4ESpJJf6+e+aQKaA
 GCRAxTV8TQtY98+HKIKFNpXvR31iApRQ4crqkXIzmTiT9Num87vOzzzLo9dYTprX
 /rG652hWjYMzOPCDyR4z/W8LM9oUdVjQdb3VITBfQbzz5z0EjJfNlAug8y5PAXKO
 sJPTCEd3ayN9Hbn6XFXUc1MzNLMu7+wPop/Oui2dpszMpcLYffJ2cG6MRvdYByTD
 cU7aqyKotcwFJYCDInSwFBIYb0sJ4tdz/NpeFmRPwRhHBjKmqn/wM8Rc8pewx7/3
 TmnyeTzrD0SPIQi2RbHUl2JTq3zwJ9Ur8+FQKYY/ZqeeJcSkutjoQGybtxN4PkYZ
 Gn/qLZHJ3SWY1Najhf85OmzyDWV4lxDoGzz+83E5Z8GkPT/Pa7DXdYql2rQJfGlk
 sJ/Bx/+OK0dN1vylGkEiKnAiF/HSZlSQKX6v/swN15bEmpOHWSMn0NmW5qaf5I3Q
 0Ix/PbO1OH0=
 =it8r
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "The most important one fixes possible infinite loops reported by a
  smartphone vendor OPPO recently due to some unexpected zero-sized
  compressed pcluster out of interrupted I/Os, storage failures, etc.

  Another patch fixes global buffer memory leak on unloading, and the
  remaining one switches to use super_set_uuid() to keep with the other
  filesystems.

  Summary:

   - Fix possible global buffer memory leak when unloading EROFS module

   - Fix FS_IOC_GETFSUUID ioctl by using super_set_uuid()

   - Reset m_llen to 0 so then it can retry if metadata is invalid"

* tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: ensure m_llen is reset to 0 if metadata is invalid
  erofs: convert to use super_set_uuid to support for FS_IOC_GETFSUUID
  erofs: fix possible memory leak in z_erofs_gbuf_exit()
2024-07-02 11:59:34 -07:00
Jann Horn
3cad1bc010
filelock: Remove locks reliably when fcntl/close race is detected
When fcntl_setlk() races with close(), it removes the created lock with
do_lock_file_wait().
However, LSMs can allow the first do_lock_file_wait() that created the lock
while denying the second do_lock_file_wait() that tries to remove the lock.
In theory (but AFAIK not in practice), posix_lock_file() could also fail to
remove a lock due to GFP_KERNEL allocation failure (when splitting a range
in the middle).

After the bug has been triggered, use-after-free reads will occur in
lock_get_status() when userspace reads /proc/locks. This can likely be used
to read arbitrary kernel memory, but can't corrupt kernel memory.
This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
enforcing mode and only works from some security contexts.

Fix it by calling locks_remove_posix() instead, which is designed to
reliably get rid of POSIX locks associated with the given file and
files_struct and is also used by filp_flush().

Fixes: c293621bbf67 ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 20:48:14 +02:00
Filipe Manana
9da45c88e1 btrfs: fix uninitialized return value in the ref-verify tool
In the ref-verify tool, when processing the inline references of an extent
item, we may end up returning with uninitialized return value, because:

1) The 'ret' variable is not initialized if there are no inline extent
   references ('ptr' == 'end' before the while loop starts);

2) If we find an extent owner inline reference we don't initialize 'ret'.

So fix these cases by initializing 'ret' to 0 when declaring the variable
and set it to -EINVAL if we find an extent owner inline references and
simple quotas are not enabled (as well as print an error message).

Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02 19:14:57 +02:00
Qu Wenruo
724d8042ce btrfs: always do the basic checks for btrfs_qgroup_inherit structure
[BUG]
Syzbot reports the following regression detected by KASAN:

  BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
  Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171

  CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
   print_address_description mm/kasan/report.c:377 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:488
   kasan_report+0x143/0x180 mm/kasan/report.c:601
   btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
   create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854
   create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922
   btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382
   create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875
   btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029
   btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075
   __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340
   btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422
   btrfs_ioctl+0x99e/0xc60
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:907 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7fcbf1992509
  RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509
  RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003
  RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660
  R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e
   </TASK>

And it also pinned it down to commit b5357cb268c4 ("btrfs: qgroup: do not
check qgroup inherit if qgroup is disabled").

[CAUSE]
That offending commit skips the whole qgroup inherit check if qgroup is
not enabled.

But that also skips the very basic checks like
num_ref_copies/num_excl_copies and the structure size checks.

Meaning if a qgroup enable/disable race is happening at the background,
and we pass a btrfs_qgroup_inherit structure when the qgroup is
disabled, the check would be completely skipped.

Then at the time of transaction commitment, qgroup is re-enabled and
btrfs_qgroup_inherit() is going to use the incorrect structure and
causing the above KASAN error.

[FIX]
Make btrfs_qgroup_check_inherit() only skip the source qgroup checks.
So that even if invalid btrfs_qgroup_inherit structure is passed in, we
can still reject invalid ones no matter if qgroup is enabled or not.

Furthermore we do already have an extra safety inside
btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources,
so even if we only skip the qgroup source check we're still safe.

Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com
Fixes: b5357cb268c4 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02 19:14:36 +02:00
Naohiro Aota
64d2c847ba btrfs: zoned: fix calc_available_free_space() for zoned mode
calc_available_free_space() returns the total size of metadata (or
system) block groups, which can be allocated from unallocated disk
space. The logic is wrong on zoned mode in two places.

First, the calculation of data_chunk_size is wrong. We always allocate
one zone as one chunk, and no partial allocation of a zone. So, we
should use zone_size (= data_sinfo->chunk_size) as it is.

Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
aligned bytes will result in less pressure on the async metadata reclaim
process.

This is serious for the nearly full state with a large zone size device.
Allowing over-commit too much will result in less async reclaim work and
end up in ENOSPC. We can align down to the zone size to avoid that.

Fixes: cb6cbab79055 ("btrfs: adjust overcommit logic when very close to full")
CC: stable@vger.kernel.org # 6.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02 19:13:11 +02:00
Kent Overstreet
a2d23f3d91 bcachefs: io clock: run timer fns under clock lock
We don't have a way to flush a timer that's executing the callback, and
this is simple and limited enough in scope that we can just use the lock
instead.

Needed for the next patch that adds direct wakeups from the allocator to
copygc, where we're now more frequently calling io_timer_del() on an
expiring timer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-01 22:56:28 -04:00
Linus Torvalds
cfbc0ffea8 for-6.10-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaC45kACgkQxWXV+ddt
 WDt6yRAAkn3n/nkapAvQbtOEIAV9GOc+DYecQXLM+E6m85vsvOBO6OeO/QDfIGvI
 ALNE4EEKTkmqk6AOLNX9rQUvo8aOaDXj/9bNZYuVSCzG2hfwLijv6DlRShxuEIE8
 kzOisuIZ4w7aL7G7OsUa50j/BLdZ47+iVF/79N+odhdaCDhhK/CcfLbemiLUS9AF
 OYkYDyl2WCo9HLduSlVHDWXNUKs7I6/S29UWQpkTKTkmLlMk7rdkgbpjgpKjFxsd
 /CuVW5NEGs+4dkV2OdOJ9t+f4qGJ56YuJanvrV3R1bGh+sgDzrcA0kP8318nFzgG
 KBYTXjAZoe5RAi4IYfMhSrEExo2JJYFeei0B7Dv3M4IvxLOF7NdMvppBFdODF0A4
 20gZ8EgNtZ0sMEafK2WAZSI23sjX0TH+/P3FFKQszxX0fRH3NV6al2drhuBTeIKr
 UzxDqeDuqpDHSZskHIMR9nMEymcov5JTg5v2ds64WiT8kr3L23u3j1KO+8UrZ8j4
 eB5vYEXE9VJDNPoJQ/qhbqDduzIJ1s+RU8lhEVIUJRIThRk/oHc7YcXgXKqvS6RM
 ilJeAzjEhp2t3CQDfx+FJ9jjgi5jhOoCkbITg6iiiExIt++XOXJtL0y2BH514LA5
 5jQXoXl/YnukfCaKLNRsxYXCw7JAQxWQiVq3s9g1tG5679zrfNA=
 =RkBU
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fixup for a recent fix that prevents an infinite loop during block
  group reclaim.

  Unfortunately it introduced an unsafe way of updating block group list
  and could race with relocation. This could be hit on fast devices when
  relocation/balance does not have enough space"

* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix adding block group to a reclaim list and the unused list during reclaim
2024-07-01 12:48:28 -07:00
Linus Torvalds
9b458a2600 vfs-6.10-rc7.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZoJSmAAKCRCRxhvAZXjc
 ot3tAQCUjJh7jZvmmkUV0pF51JI1jEumk8d8vPORGsm1A6oMawEA+tyiWYkcIU3t
 JUFGZSDce5MuJEI/frDPb98CW2dLkQA=
 =fVtx
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "Misc:

   - Don't misleadingly warn during filesystem thaw operations.

     It's possible that a block device which was frozen before it was
     mounted can cause a failing thaw operation if someone concurrently
     tried to mount it while that thaw operation was issued and the
     device had already been temporarily claimed for the mount (The
     mount will of course be aborted because the device is frozen).

  netfs:

   - Fix io_uring based write-through. Make sure that the total request
     length is correctly set.

   - Fix partial writes to folio tail.

   - Remove some xarray helpers that were intended for bounce buffers
     which got defered to a later patch series.

   - Make netfs_page_mkwrite() whether folio->mapping is vallid after
     acquiring the folio lock.

   - Make netfs_page_mkrite() flush conflicting data instead of waiting.

  fsnotify:

   - Ensure that fsnotify creation events are generated before fsnotify
     open events when a file is created via ->atomic_open(). The
     ordering was broken before.

   - Ensure that no fsnotify events are generated for O_PATH file
     descriptors. While no fsnotify open events were generated, fsnotify
     close events were. Make it consistent and don't produce any"

* tag 'vfs-6.10-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix netfs_page_mkwrite() to flush conflicting data, not wait
  netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid
  netfs: Delete some xarray-wangling functions that aren't used
  netfs: Fix early issue of write op on partial write to folio tail
  netfs: Fix io_uring based write-through
  vfs: generate FS_CREATE before FS_OPEN when ->atomic_open used.
  fsnotify: Do not generate events for O_PATH file descriptors
  fs: don't misleadingly warn during thaw operations
2024-07-01 09:22:08 -07:00
Naohiro Aota
48f091fd50 btrfs: fix adding block group to a reclaim list and the unused list during reclaim
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.

Fix it by taking fs_info->unused_bgs_lock.

  [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
  [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
  [177.529][T2585409] ------------[ cut here ]------------
  [177.537][T2585409] kernel BUG at lib/list_debug.c:65!
  [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
  [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
  [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
  [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
  [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
  [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
  [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
  [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
  [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
  [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
  [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
  [177.687][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
  [177.700][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
  [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
  [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
  [177.742][T2585409] PKRU: 55555554
  [177.748][T2585409] Call Trace:
  [177.753][T2585409]  <TASK>
  [177.759][T2585409]  ? __die_body.cold+0x19/0x27
  [177.766][T2585409]  ? die+0x2e/0x50
  [177.772][T2585409]  ? do_trap+0x1ea/0x2d0
  [177.779][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
  [177.788][T2585409]  ? do_error_trap+0xa3/0x160
  [177.795][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
  [177.805][T2585409]  ? handle_invalid_op+0x2c/0x40
  [177.812][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
  [177.820][T2585409]  ? exc_invalid_op+0x2d/0x40
  [177.827][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
  [177.834][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
  [177.843][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]

There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.

The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.

That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.

The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-01 17:33:15 +02:00
Gao Xiang
9b32b063be erofs: ensure m_llen is reset to 0 if metadata is invalid
Sometimes, the on-disk metadata might be invalid due to user
interrupts, storage failures, or other unknown causes.

In that case, z_erofs_map_blocks_iter() may still return a valid
m_llen while other fields remain invalid (e.g., m_plen can be 0).

Due to the return value of z_erofs_scan_folio() in some path will
be ignored on purpose, the following z_erofs_scan_folio() could
then use the invalid value by accident.

Let's reset m_llen to 0 to prevent this.

Link: https://lore.kernel.org/r/20240629185743.2819229-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-06-30 10:54:28 +08:00
Kent Overstreet
b5cbb42dc5 bcachefs: Repair fragmentation_lru in alloc_write_key()
fragmentation_lru derives from dirty_sectors, and wasn't being checked.

Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:37:13 -04:00
Kent Overstreet
d39881d2da bcachefs: add check for missing fragmentation in check_alloc_to_lru_ref()
We need to make sure we're not missing any fragmenation entries in the
LRU BTREE after repairing ALLOC BTREE

Also, use the new bch2_btree_write_buffer_maybe_flush() helper; this was
only working without it before since bucket invalidation (usually)
wasn't happening while fsck was running.

Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:35:18 -04:00
Kent Overstreet
92e1c29ae8 bcachefs: bch2_btree_write_buffer_maybe_flush()
Add a new helper for checking references to write buffer btrees, where
we need a flush before we definitively know we have an inconsistency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:34:52 -04:00
Kent Overstreet
ef05bdf5d6 bcachefs: Add missing printbuf_tabstops_reset() calls
Fixes warnings from bch2_print_allocator_stuck()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:14:18 -04:00
Linus Torvalds
27b31deb90 Bug fixes for 6.10-rc6:
* Always free only post-EOF delayed allocations for files with the
     XFS_DIFLAG_PREALLOC or APPEND flags set.
   * Do not align cow fork delalloc to cowextsz hint when running low on space.
   * Allow zero-size symlinks and directories as long as the link count is
     zero.
   * Change XFS_IOC_EXCHANGE_RANGE to be a _IOW only ioctl. This was ioctl was
     introduced during v6.10 developement cycle.
   * xfs_init_new_inode() now creates an attribute fork on a newly created
     inode even if ATTR feature flag is not enabled.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZnvYdwAKCRAH7y4RirJu
 9DRmAP9VwmSgBrVGZ459K6LluP12FoIpzUljEYSiQiyjhxuQJgD/fou/8G+/TTQH
 3TtdmC8Xo7SWRMq9+wPpH5OywbsvZQM=
 =fV8d
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Always free only post-EOF delayed allocations for files with the
   XFS_DIFLAG_PREALLOC or APPEND flags set.

 - Do not align cow fork delalloc to cowextsz hint when running low on
   space.

 - Allow zero-size symlinks and directories as long as the link count is
   zero.

 - Change XFS_IOC_EXCHANGE_RANGE to be a _IOW only ioctl. This was ioctl
   was introduced during v6.10 developement cycle.

 - xfs_init_new_inode() now creates an attribute fork on a newly created
   inode even if ATTR feature flag is not enabled.

* tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
  xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
  xfs: allow unlinked symlinks and dirs with zero size
  xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  xfs: fix freeing speculative preallocations for preallocated files
2024-06-29 09:21:40 -07:00
Kent Overstreet
67c564111f bcachefs: Fix loop restart in bch2_btree_transactions_read()
Accidental infinite loop; also fix btree_deadlock_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 21:08:48 -04:00
Kent Overstreet
1539bdf516 bcachefs: Fix bch2_read_retry_nodecode()
BCH_READ_NODECODE mode - used by the move paths - really wants to use
only the original rbio, but the retry path really wants to clone - oof.

Make sure to copy the crc of the pointer we read from back to the
original rbio, or we'll see spurious checksum errors later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 20:47:04 -04:00
Kent Overstreet
44ec599035 bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.

Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:47:31 -04:00
Kent Overstreet
a0bd30e4ea bcachefs: Fix shift greater than integer size
Reported-by: syzbot+e5292b50f1957164a4b6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:42:22 -04:00
Kent Overstreet
600b8be5e7 bcachefs: Change bch2_fs_journal_stop() BUG_ON() to warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:16:41 -04:00
Kent Overstreet
84db600016 bcachefs: Delete old faulty bch2_trans_unlock() call
the unlock is now in read_extent, this fixes an assertion pop in
read_from_stale_dirty_pointer()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 13:30:13 -04:00
Linus Torvalds
6c0483dbfe nfsd-6.10 fixes:
- Due to a late review, revert and re-fix a recent crasher fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZ+zksACgkQM2qzM29m
 f5ey0Q/+O8Z//rVg8L8V1EI44gZ1LRTiJEeDFoJU1Hktn0dOwTvGQJNzNbfpTBbT
 zTvvDBd9+cbHQ7Jl2r73+wkqgZbGXsRhMRUCmlA5ur0O7ujkYDo3rpOgQhP4V3LV
 1G6o2uOJT9d9UykUF74Demlg97kNrc8CY1QNujXjKrj8jYpQY4KcRCDb91KdYgc9
 GSque9zMD2jTJkTkSVRUyfWGEDvYtDxs7/4rdErjixaY30dgkPHae6euRzoQd2sC
 v3aoKHSt5Y/pvOX+3/c2QzDqOiKPOnm3dLazB+rK6lYrnt4iUeP3CiPcFmqNJZk2
 kZ3hpCPN8Le3oQFlMe450EljB8wLj2DqLix+Y/iOGyA732GLxNW6Hu8PkwRVFFwx
 cGGp6KryrcwO/BNHgxceErfhBG4IPKSgnY6C8Io+mPYKwSRAOKA4su02RoJDWJkA
 PTf67JgZiKv+bn3EGojeTo9YAvb04EYstttK2yBFUR9xylyew9vaZ9X8mYkqfyHN
 pZEWf/SI8SExZteo3Ymx9HLcX5FYsBjgvigpow33kUCn2hMhuhfBXEtmLNsxOSYC
 aEICcAYqnBovFJ+3vbz11IDS04ERIFdprlLQuExtJ21pujIlIM9pOxrIGvgu5pPl
 ++rsDP31Kh6imuZbSaJ5TFTdtDqcpah587m+mQtlaXAjOQL1qfw=
 =gwat
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Due to a late review, revert and re-fix a recent crasher fix

* tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "nfsd: fix oops when reading pool_stats before server is started"
  nfsd: initialise nfsd_info.mutex early.
2024-06-28 09:32:33 -07:00
Linus Torvalds
cd63a278ac bcachefs fixes for 6.10-rc6
simple stuff:
 - null ptr/err ptr deref fixes
 - fix for getting wedged on shutdown after journal error
 
 - fix missing recalc_capacity() call, capacity now changes correctly
   after a device goes read only
 
   however: our capacity calculation still doesn't take into account when
   we have mixed ro/rw devices and the ro devices have data on them,
   that's going to be a more involved fix to separate accounting for
   "capacity used on ro devices" and "capacity used on rw devices"
 
 - boring syzbot stuff
 
 slightly more involved:
 - discard, invalidate workers are now per device
   this has the effect of simplifying how we take device refs in these
   paths, and the device ref cleanup fixes a longstanding race between
   the device removal path and the discard path
 
 - fixes for how the debugfs code takes refs on btree_trans objects
   we have debugfs code that prints in use btree_trans objects. It uses
   closure_get() on trans->ref, which is mainly for the cycle detector,
   but the debugfs code was using it on a closure that may have hit 0,
   which is not allowed; for performance reasons we cannot avoid having
   not-in-use transactions on the global list.
 
   introduce some new primitives to fix this and make the synchronization
   here a whole lot saner
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZ+ye4ACgkQE6szbY3K
 bnb4shAAqkKdgB2abIaD1t8+KjUiXwt7seRY4EmzwrEaWniW5bDUYMBvV+tew93j
 uvGmSKMs4ML/r24hcg0zGPJ9GoWrFb3MWhPYizzRS8QspsUjsECJuehNPCe3RPaf
 QBgQtKahTge1e41y1frzkiGKqaOGOTtUVLOfPIebe+oJAhRCYRnrGY2dkZTms7Ue
 aXNtBmnlX3Fkmlm0GiKYrTHpAZz3d0kzdX11Pc2vTXvqo/znuJTTVGnjJkdrHzyv
 6cz6YnMKFdxLVbYO1KlB/3Hu9y9qt815g1rjvaqym8pDk9ltsGHNM3LcCCCyp7Of
 btnbLQ6TdfggK5Kf2hNYuJRY2pnjNyfcNxupQF3RNaw/D/4G5EU16zfFElORC6Mw
 eGwXLvDIGqOSSIvevoRZrgJKAvVptXNg9EtCI5Z5ujQ4ExW8ti1lPHp/r5SVOhyz
 x0Am14H2ERuz7Vt5jUas3k74+tAck6JWc5OemMQawA5waeH1inMT7QZuBt+Bmrhx
 Av0zbhaq4aTsHXmm+Xi6ofj3UBaOQ2rNzT7Au0kxdvJgDPe/USjw4tejV5DmjmHA
 SyRsTG7Zn5xJBi7jc47fcwUgUzlxlffVQGFCVjRUU1vF6u/Ldn7K0zfYbkwSCiKp
 iWSEyg3j5z5N69Vrgdadma4xTDjL/C5+XsMWh8G8ohf+crhUeSo=
 =svIi
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Simple stuff:

   - NULL ptr/err ptr deref fixes

   - fix for getting wedged on shutdown after journal error

   - fix missing recalc_capacity() call, capacity now changes correctly
     after a device goes read only

     however: our capacity calculation still doesn't take into account
     when we have mixed ro/rw devices and the ro devices have data on
     them, that's going to be a more involved fix to separate accounting
     for "capacity used on ro devices" and "capacity used on rw devices"

   - boring syzbot stuff

  Slightly more involved:

   - discard, invalidate workers are now per device

     this has the effect of simplifying how we take device refs in these
     paths, and the device ref cleanup fixes a longstanding race between
     the device removal path and the discard path

   - fixes for how the debugfs code takes refs on btree_trans objects we
     have debugfs code that prints in use btree_trans objects.

     It uses closure_get() on trans->ref, which is mainly for the cycle
     detector, but the debugfs code was using it on a closure that may
     have hit 0, which is not allowed; for performance reasons we cannot
     avoid having not-in-use transactions on the global list.

     Introduce some new primitives to fix this and make the
     synchronization here a whole lot saner"

* tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: Fix kmalloc bug in __snapshot_t_mut
  bcachefs: Discard, invalidate workers are now per device
  bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
  bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
  bcachefs: Add missing bch2_journal_do_writes() call
  bcachefs: Fix null ptr deref in journal_pins_to_text()
  bcachefs: Add missing recalc_capacity() call
  bcachefs: Fix btree_trans list ordering
  bcachefs: Fix race between trans_put() and btree_transactions_read()
  closures: closure_get_not_zero(), closure_return_sync()
  bcachefs: Make btree_deadlock_to_text() clearer
  bcachefs: fix seqmutex_relock()
  bcachefs: Fix freeing of error pointers
2024-06-28 09:25:21 -07:00
Kent Overstreet
759b2e800f bcachefs: Switch online_reserved shutdown assert to WARN()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 11:06:31 -04:00
Linus Torvalds
adfbe3640b asm-generic fixes for 6.10
These are some bugfixes for system call ABI issues I found while
 working on a cleanup series. None of these are urgent since these
 bugs have gone unnoticed for many years, but I think we probably
 want to backport them all to stable kernels, so it makes sense
 to have the fixes included as early as possible.
 
 One more fix addresses a compile-time warning in kallsyms that was
 uncovered by a patch I did to enable additional warnings in 6.10. I had
 mistakenly thought that this fix was already merged through the module
 tree, but as Geert pointed out it was still missing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmZ9iRQACgkQYKtH/8kJ
 UicHIxAA0ej8dMJ3znHovc/CQYkZMpb88bxLlqLotOYuOItEzvR6wd7vnu4cPeZf
 nHguBiP9RAnzCZhL3F7AS3p8NNJ+P1OZo+sj6tZOANO955mzj1VQ5p2fbSRw+WI3
 4Oc1HKvP6UMhHGjU3wHY0+Odd5bpoepN9/fnoiQcHPzq0LbUFM8e4D9KGr51I7fV
 r7tuDMy9xykEfs6umuDu9wOXih3JkpV9eSmefmjvzgxG3hKLdsvTbWVsVmnKXhZm
 xdFiTROOmiNvttfkQh0ruBd0drBl8aVhzCKPqIe0vQqS9rBmcf9WTkcJzpihq/fI
 BA3QjVQFvmHeXs+viaLZf4r/y0qabaTPRBMQxZyEFE0QgtwfxT4/ZnNEbH2s3pIC
 Pcm0JltLlHLbZs7V63drL6txCoFVndiPXdEBTBsqBwnuDHXCj/tvDcO3tuVTfYoz
 9G8TTOsYNEDLYmn8AmzzhJOh75gp6O6A2ui3TtcD9KFNaoTQqqzPJWp8IoxBfxcb
 3+rzRWQvXAhfSRBIaejv1quo2ZxoZk3KO3i+ysRITTUF1MLz7b0/Yy/8r74CqmOu
 8Iw2Q0BaFPtj1x+VjneQnL++iYWYPEh+ZBEg7AD/z6QHwMLz33SyHlD+/RgRkthV
 J/L9xUBs5HagWJxRYkVc+l0LOVclTqVJieKD2AWONZ5OFRB+CCI=
 =ieQy
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic fixes from Arnd Bergmann:
 "These are some bugfixes for system call ABI issues I found while
  working on a cleanup series. None of these are urgent since these bugs
  have gone unnoticed for many years, but I think we probably want to
  backport them all to stable kernels, so it makes sense to have the
  fixes included as early as possible.

  One more fix addresses a compile-time warning in kallsyms that was
  uncovered by a patch I did to enable additional warnings in 6.10. I
  had mistakenly thought that this fix was already merged through the
  module tree, but as Geert pointed out it was still missing"

* tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  kallsyms: rework symbol lookup return codes
  linux/syscalls.h: add missing __user annotations
  syscalls: mmap(): use unsigned offset type consistently
  s390: remove native mmap2() syscall
  hexagon: fix fadvise64_64 calling conventions
  csky, hexagon: fix broken sys_sync_file_range
  sh: rework sync_file_range ABI
  powerpc: restore some missing spu syscalls
  parisc: use generic sys_fanotify_mark implementation
  parisc: use correct compat recv/recvfrom syscalls
  sparc: fix compat recv/recvfrom syscalls
  sparc: fix old compat_sys_select()
  syscalls: fix compat_sys_io_pgetevents_time64 usage
  ftruncate: pass a signed offset
2024-06-27 10:53:52 -07:00
Linus Torvalds
66e55ff12e for-6.10-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZ9gV0ACgkQxWXV+ddt
 WDsK2hAAmbOArbK5QHprawdOqqvJL46yoGCMba798EjYo53+hO1F8/lb531+zCUI
 GDhdIC2mHdRIkARJ8Cde5POUjID1Kv3Rlc0rdHy3nOw38WZmA/+HdkcKzQhsDFSR
 /FX9RKSWiu0xl6JdCLh4KkIWE+2m1v1kybhvRHCKb+70iBua1e+OSoM33BeiIhrP
 yoFwMwIbzG2CoZOHoobDxUjs9ZMUHm4wH0csJYG9R59Vv7uLBOgpWuQB46iqpoj4
 EYR8Sg8PscI7YXa0y8VTP3pdrNMW48IC6jerIAKHUeWRWRoTCh9He+3/E4xjNHxz
 3Pm+Aat5QYdsqmE68IbeN5c7QB1YAdUCgoJJJwAFjwe9WtTn8RS9RiMjWIr+VHYm
 E3REibQI151p0yHwAl8xPHDiTecmlNisof0eg6gzHdvODm/NYFuFapD+aDxWribX
 G63dOa8Fy0h4pwDoF73Rd2YYbtO6tDVSNVIG3bWpPep3r+SI/oC4JMHbmn1aqmqF
 c0/ZYVbsx/Hm066l4LCrpgi7kJ8en2zQ8MmcZHHt+2gXe1AyAON7kvRHaEizUCaA
 fnLVhQvaOofC4g7DJc1JkwyLc9VF5hMTYUhldoJvqj1wm2qsT/siJgKVvllRGoMs
 FU2qlWYaN/fPRylyEySyZPq/sWKHOAZZSOhyWM8SB2nYoBXNkO0=
 =qpEp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix quota root leak after quota disable failure

 - fix condition when checking if a zone can be added as free

 - allocate inode in NOFS context during logging or tree-log replay

 - handle raid-stripe-tree lookup correctly during scrub

* tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: qgroup: fix quota root leak after quota disable failure
  btrfs: scrub: handle RST lookup error correctly
  btrfs: zoned: fix initial free space detection
  btrfs: use NOFS context when getting inodes during logging and log replay
2024-06-27 10:26:16 -07:00
David Howells
9d66154f73
netfs: Fix netfs_page_mkwrite() to flush conflicting data, not wait
Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not
filemap_fdatawait_range() to flush conflicting data.

Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/614300.1719228243@warthog.procyon.org.uk
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26 14:19:08 +02:00
David Howells
a81c98bfa4
netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has
taken the folio lock (as filemap_page_mkwrite() does).  Without this,
generic/247 occasionally oopses with something like the following:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page

    RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0
    ...
    Call Trace:
     <TASK>
     ? __die_body+0x1a/0x60
     ? page_fault_oops+0x6e/0xa0
     ? exc_page_fault+0xc2/0xe0
     ? asm_exc_page_fault+0x22/0x30
     ? trace_event_raw_event_netfs_folio+0x61/0xc0
     trace_netfs_folio+0x39/0x40
     netfs_page_mkwrite+0x14c/0x1d0
     do_page_mkwrite+0x50/0x90
     do_pte_missing+0x184/0x200
     __handle_mm_fault+0x42d/0x500
     handle_mm_fault+0x121/0x1f0
     do_user_addr_fault+0x23e/0x3c0
     exc_page_fault+0xc2/0xe0
     asm_exc_page_fault+0x22/0x30

This is due to the invalidate_inode_pages2_range() issued at the end of the
DIO write interfering with the mmap'd writes.

Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26 14:18:17 +02:00
David Howells
84dfbc9cad
netfs: Delete some xarray-wangling functions that aren't used
Delete some xarray-based buffer wangling functions that are intended for
use with bounce buffering, but aren't used because bounce-buffering got
deferred to a later patch series.  Now, however, the intention is to use
something other than an xarray to do this.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26 14:16:49 +02:00
David Howells
6470e0bc6f
netfs: Fix early issue of write op on partial write to folio tail
During the writeback procedure, at the end of netfs_write_folio(), pending
write operations are flushed if the amount of write-streaming data stored
in a page is less than the size of the folio because if we haven't modified
a folio to the end, it cannot be contiguous with the following folio...
except if the dirty region of the folio is right at the end of the folio
space.

Fix the test to take the offset into the folio into account as well, such
that if the dirty region runs right up to the end of the folio, we leave
the flushing for later.

Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space)
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26 14:16:04 +02:00
David Howells
d98b7d7dda
netfs: Fix io_uring based write-through
[This was included in v2 of 9b038d004ce95551cb35381c49fe896c5bc11ffe, but
v1 got pushed instead]

Fix netfs_unbuffered_write_iter_locked() to set the total request length in
the netfs_io_request struct rather than leaving it as zero.

Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <stfrench@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: Christian Brauner <brauner@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-2-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26 14:15:26 +02:00
Huang Xiaojia
cc69a681b2 erofs: convert to use super_set_uuid to support for FS_IOC_GETFSUUID
FS_IOC_GETFSUUID ioctl exposes the uuid of a filesystem. To support
the ioctl, init sb->s_uuid with super_set_uuid().

Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240624063704.2476070-1-huangxiaojia2@huawei.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-06-26 17:02:28 +08:00
Sandeep Dhavale
9d01f6f6d8 erofs: fix possible memory leak in z_erofs_gbuf_exit()
Because we incorrectly reused of variable `i` in `z_erofs_gbuf_exit()`
for inner loop, we may exit early from outer loop resulting in memory
leak. Fix this by using separate variable for iterating through inner loop.

Fixes: f36f3010f676 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable")
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240624220206.3373197-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-06-26 17:02:15 +08:00
Darrick J. Wong
673cd885bb xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
xfs_init_new_inode ignores the init_xattrs parameter for filesystems
that do not have ATTR enabled.  As a result, the first init_xattrs file
to be created by the kernel will not have an attr fork created to store
acls.  Storing that first acl will add ATTR to the superblock flags, so
subsequent files will be created with attr forks.  The overhead of this
is so small that chances are that nobody has noticed this behavior.

However, this is disastrous on a filesystem with parent pointers because
it requires that a new linkable file /must/ have a pre-existing attr
fork, and the parent pointers code uses init_xattrs to create that fork.
The preproduction version of mkfs.xfs used to set this, but the V5 sb
verifier only requires ATTR2, not ATTR.  There is no guard for
filesystems with (PARENT && !ATTR).

It turns out that I misunderstood the two flags -- ATTR means that we at
some point created an attr fork to store xattrs in a file; ATTR2
apparently means only that inodes have dynamic fork offsets or that the
filesystem was mounted with the "attr2" option.

Fixes: 2442ee15bb1e ("xfs: eager inode attr fork init needs attr feature awareness")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
dc5e1cbae2 xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl.  Change this before 6.10 final
releases.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
1ec9307fc0 xfs: allow unlinked symlinks and dirs with zero size
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd876 changed the inode verifier to
prohibit zero-length symlinks and directories.  If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.

Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.

Fixes: 3c6f46eacd876 ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:25 +05:30
Darrick J. Wong
288e1f693f xfs: restrict when we try to align cow fork delalloc to cowextsz hints
xfs/205 produces the following failure when always_cow is enabled:

  --- a/tests/xfs/205.out	2024-02-28 16:20:24.437887970 -0800
  +++ b/tests/xfs/205.out.bad	2024-06-03 21:13:40.584000000 -0700
  @@ -1,4 +1,5 @@
   QA output created by 205
   *** one file
  +   !!! disk full (expected)
   *** one file, a few bytes at a time
   *** done

This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint.  Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.

We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation.  There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.

There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space.  Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.

Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry.  This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.

I probably should have caught this six years ago when 6ca30729c206d was
being reviewed, but oh well.  Update the comments to reflect what the
code does now.

Fixes: 6ca30729c206d ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:24 +05:30
Christoph Hellwig
610b29161b xfs: fix freeing speculative preallocations for preallocated files
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks.  This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.

Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-26 14:29:17 +05:30
Pei Li
64cd7de998 bcachefs: Fix kmalloc bug in __snapshot_t_mut
When allocating too huge a snapshot table, we should fail gracefully
in __snapshot_t_mut() instead of fail in kmalloc().

Reported-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=770e99b65e26fa023ab1
Tested-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 20:51:14 -04:00
Kent Overstreet
64ee1431cc bcachefs: Discard, invalidate workers are now per device
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.

Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 18:47:55 -04:00
Pei Li
472237b69d bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
This series fix the shift-out-of-bounds issue in
bch2_blacklist_entries_gc().

Instead of passing 0 to eytzinger0_first() when iterating the entries,
we explicitly check 0 and initialize i to be 0.

syzbot has tested the proposed patch and the reproducer did not trigger
any issue:

Reported-and-tested-by: syzbot+835d255ad6bc7f29ee12@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=835d255ad6bc7f29ee12
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 17:53:31 -04:00
Pei Li
211c581de2 bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
Acquire fsck_error_counts_lock before accessing the critical section
protected by this lock.

syzbot has tested the proposed patch and the reproducer did not trigger
any issue.

Reported-by: syzbot+a2bc0e838efd7663f4d9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a2bc0e838efd7663f4d9
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 17:51:26 -04:00
NeilBrown
e0011bca60 nfsd: initialise nfsd_info.mutex early.
nfsd_info.mutex can be dereferenced by svc_pool_stats_start()
immediately after the new netns is created.  Currently this can
trigger an oops.

Move the initialisation earlier before it can possibly be dereferenced.

Fixes: 7b207ccd9833 ("svc: don't hold reference for poolstats, only mutex.")
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/all/c2e9f6de-1ec4-4d3a-b18d-d5a6ec0814a0@linux.ibm.com/
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-06-25 10:15:39 -04:00
Christoph Hellwig
54e7d59841 nfs: drop the incorrect assertion in nfs_swap_rw()
Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS
swap-space"), we can plug multiple pages then unplug them all together. 
That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
actually equals the size of iov_iter_npages(iter, INT_MAX).

Note this issue has nothing to do with large folios as we don't support
THP_SWPOUT to non-block devices.

[v-songbaohua@oppo.com: figure out the cause and correct the commit message]
Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com
Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/
Reviewed-by: Martin Wege <martin.l.wege@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <trondmy@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24 20:52:11 -07:00