89300 Commits

Author SHA1 Message Date
Kent Overstreet
5b6271b509 bcachefs: Cleanup bch2_dirent_lookup_trans()
Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __
from the name - this is a normal interface.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
23f2522315 bcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()
Minor renaming for clarity, bit of refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
6b83aee8a4 bcachefs: Workqueues should be WQ_HIGHPRI
Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
3f305e0498 bcachefs: Improve bch2_dirent_to_text()
For DT_SUBVOL, we now print both parent and child subvol IDs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
7b05ecbafc bcachefs: fixup for building in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
e6fab655e6 bcachefs: Avoid taking journal lock unnecessarily
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.

This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
bdec47f57f bcachefs: Journal writes should be REQ_SYNC|REQ_META
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
a4e9233911 bcachefs: Avoid setting j->write_work unnecessarily
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
656f05d8bd bcachefs: Split out journal workqueue
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
4f70176cb9 bcachefs: Kill unnecessary wakeups in journal reclaim
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Guoyu Ou
0be5b38bce bcachefs: skip invisible entries in empty subvolume checking
When we are checking whether a subvolume is empty in the specified snapshot,
entries that do not belong to this subvolume should be skipped.

This fixes the following case:

    $ bcachefs subvolume create ./sub
    $ cd sub
    $ bcachefs subvolume create ./sub2
    $ bcachefs subvolume snapshot . ./snap
    $ ls -a snap
    . ..
    $ rmdir snap
    rmdir: failed to remove 'snap': Directory not empty

As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols
in the subvol we are checking, because inode.bi_subvol is only set on
subvolume roots, and we can't go through every inode in the subvolume and
change bi_subvol when taking a snapshot. It makes the check less strict, but
that's ok, the rest of fsck will still catch it.

Signed-off-by: Guoyu Ou <benogy@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:07 -04:00
Kent Overstreet
067f244c9e bcachefs: fix split brain message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:30:56 -04:00
Kent Overstreet
fadc6067f2 bcachefs: Set path->uptodate when no node at level
We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:30:56 -04:00
Kent Overstreet
94817db956 bcachefs: Correctly validate k->u64s in btree node read path
validate_bset_keys() never properly validated k->u64s; it checked if it
was 0, but not if it was smaller than keys for the given packed format;
this fixes that small oversight.

This patch was backported, so it's adding quite a few error enums so
that they don't get renumbered and we don't have confusing gaps.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:21:04 -04:00
Kent Overstreet
b3eba6a4a7 bcachefs: Fix degraded mode fsck
We don't know where the superblock and journal lives on offline devices;
that means if a device is offline fsck can't check those buckets.

Previously, fsck would incorrectly clear bucket data types for those
buckets on offline devices; now we just use the previous state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:18:45 -04:00
Kent Overstreet
ba89083e9f bcachefs: Fix journal replay with unreadable btree roots
When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.

This adds bch2_btree_increase_depth(), so that journal replay can handle
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:18:13 -04:00
Kent Overstreet
52f3a72fa7 bcachefs: fix check_inode_deleted_list()
check_inode_deleted_list() returns true if the inode is on the deleted
list; check_inode() was checking the return code incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:17:00 -04:00
Kent Overstreet
2f300f09c7 bcachefs: no_splitbrain_check option
This adds an option to disable kicking out devices when splitbrain is
detected - it seems there's some issues with splitbrain detection and
we're kicking out devices erronously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:12:54 -04:00
Kent Overstreet
88005d5dfb bcachefs: extent_entry_next_safe()
We need to be able to iterate over extent ptrs that may be corrupted in
order to print them - this fixes a bug where we'd pop an assert in
bch2_bkey_durability_safe().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:12:13 -04:00
Kent Overstreet
6fa30fe7f7 bcachefs: journal_seq_blacklist_add() now handles entries being added out of order
bch2_journal_seq_blacklist_add() was bugged when the new entry
overlapped with multiple existing entries, and it also assumed new
entries are being added in increasing order.

This is true on any sane filesystem, but when trying to recover from
very badly mangled filesystems we might end up with the journal sequence
number rewinding vs. what the blacklist list knows about - easiest to
just handle that here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:09:59 -04:00
Li Zetao
f8cdf65b51 bcachefs: Fix null-ptr-deref in bch2_fs_alloc()
There is a null-ptr-deref issue reported by kasan:

  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  Call Trace:
    <TASK>
    bch2_fs_alloc+0x1092/0x2170 [bcachefs]
    bch2_fs_open+0x683/0xe10 [bcachefs]
    ...

When initializing the name of bch_fs, it needs to dynamically alloc memory
to meet the length of the name. However, when name allocation failed, it
will cause a null-ptr-deref access exception in subsequent string copy.

Fix this issue by checking if name allocation is successful.

Fixes: 401ec4db6308 ("bcachefs: Printbuf rework")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:09:59 -04:00
Linus Torvalds
e231dbd452 bcachefs fixes for 6.8-rc6
Some more mostly boring fixes, but some not
 
 User reported ones:
  - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty performance
    bug; user reported an unter initially taking 2 seconds and then ~2
    minutes
 
  - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
    from the trickier fix to kill __GFP_NOFAIL in readahead, where we
    can't return errors (and have to silently truncate the read
    ourselves).
 
    bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
    filesystems because our folio state is just barely too big, 2MB
    hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
 
    additionally, the flags argument was just buggy, we weren't supplying
    GFP_KERNEL previously (!).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXbqqMACgkQE6szbY3K
 bnYjnhAApY0vT6eVIYrZ7JGR6tw++xw02xRkcNW4zFE8INAvxQor5TXMEKkJs9Ui
 owh8WZjydXe0FJPE+pROcHMfxkkup4yP2SafgzR8DGERBwZbV9x7hvUbdG90EngY
 V/MevV+vr6UaV7133sY70K8BqUA/yAlCmmtOQVFgGRprEtEPS4Ur3vYR5+IzA0N7
 OhNXu6LxzkYbrNp9qroCN2UEVgRDJ/Mtda6uHfIUrqOQMUhiq2og9kvzJXzIrW9l
 URxm4eFQtJe0Yz09Ppypve+FutJIbtuDEYbcMJNT9Ig7BosD5vDjy9nhp8A5Q1Uk
 oDWBbCJhDdSYSVC/EQY8bv0AaCkyCa7vshSoKq0fDCFJ8k+nQ1YMF5wNhfgJhtU9
 Tl2Qytphp9/dxkvpIsR/5iNhLply9xTka1Wkp3G+3QJk0c17Dftpvz0/WhKI0P2B
 d6y4mz/hfCtWoSQOJbJl3fM/ZVpjH54VHDmb7sGyb5f+bTUkX6OUoJ4os8MNKGcS
 GdpEoWt/IAQj69c7w8aama5TXJ4kYe0XtXwbHTRE4j1PIQJA5SPvVt+32spRtb6i
 1gIa94uWKYMuG2U0XGxookHfZZZaMQkl79oXJOYRiC589YVyZC1Lp5iqr027jHEQ
 1HacrWPekPfmrhchyIzpH1mHOgaS+FKoD7eKrkvj0QSxpwfwpbI=
 =KNWR
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Some more mostly boring fixes, but some not

  User reported ones:

   - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
     performance bug; user reported an untar initially taking two
     seconds and then ~2 minutes

   - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
     from the trickier fix to kill __GFP_NOFAIL in readahead, where we
     can't return errors (and have to silently truncate the read
     ourselves).

     bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
     filesystems because our folio state is just barely too big, 2MB
     hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.

     additionally, the flags argument was just buggy, we weren't
     supplying GFP_KERNEL previously (!)"

* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: fix bch2_save_backtrace()
  bcachefs: Fix check_snapshot() memcpy
  bcachefs: Fix bch2_journal_flush_device_pins()
  bcachefs: fix iov_iter count underflow on sub-block dio read
  bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
  bcachefs: Kill __GFP_NOFAIL in buffered read path
  bcachefs: fix backpointer_to_text() when dev does not exist
2024-02-25 15:31:57 -08:00
Kent Overstreet
5197728f81 bcachefs: fix bch2_save_backtrace()
Missed a call in the previous fix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25 15:45:36 -05:00
Linus Torvalds
4ca0d9894f Change since last update:
- Fix page refcount leak when looking up specific inodes
    introduced by metabuf reworking.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmXbQekRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qpkyg//cjKnuzI2gHL3ff1o0jiTMD11Y/EmwzQS
 C+JgE8WDMjUlVQpbwYnFWo/6se8qvk/fwTXPcRc45piF8YNZVchwsagxO626Kab7
 0xTX7ZUgjuth6ahrkAJZkJNMgO4mYf928uYB6EVWaTb4iF0iT6glEOsSVAR/cssL
 wYKbtgv3OP9t6fQcN/XL31hRkqr2MPk0y5Q27KyT4zo4lrx3xyah7Ndo3aEK/RcM
 +6FUwqRiDsgDF/Ga65ylDvEp9eA03OFNHBn4DrORe3B9KV75NkmSJf/8QVEceNV/
 9D072Hvt7/iyOq53AxWH3Jvp7aro/i0rvAHPbXZX4RVyqcJxaLYCQyrBFvQL0/Ie
 B+793Iua8zbkQCbZ85LpTGrxAb5WydlSJp10AuHTr2MO+wS8bBqf96Jp9x1MuS9D
 vqq7jbjwuZfnFUjpzu49GF6htG+WRVgY5TDzU6IAr3izXuqVxmz+zw5zExbGMmKm
 2S5lb+q68DOBP4YaelAQHh97k0pYDW3eQ6GhDD2FA4P/DOr8p3vszZFBeaqaL70v
 WS8z1wzOgEaleTRN4iRFlMCTvRLjOtDEaUNquohcHqJLe7w/DF7gzCU+0//8dVjD
 cZieFX8uZDtzzcjrlYWeTo8oHd0q2WYixbCN8P92/YXUFIVpfdl85ugyr9KvkZxd
 jKzpAy2LbSw=
 =/Bfj
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fix from Gao Xiang:

 - Fix page refcount leak when looking up specific inodes
   introduced by metabuf reworking

* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix refcount on the metabuf used for inode lookup
2024-02-25 09:53:13 -08:00
Linus Torvalds
66a97c2ec9 We still have some races in filesystem methods when exposed to RCU
pathwalk.  This series is a result of code audit (the second round
 of it) and it should deal with most of that stuff.  Exceptions: ntfs3
 ->d_hash()/->d_compare() and ceph_d_revalidate().  Up to maintainers (a
 note for NTFS folks - when documentation says that a method may not block,
 it *does* imply that blocking allocations are to be avoided.  Really).
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdroDAAKCRBZ7Krx/gZQ
 60dKAQCzp6rYr3ye4nylho9Rzu8LEpH04TuNf3C6JuyUaNHxHwEAvNLatZsyFnmV
 Ksp2Rg/IlKPNtQgYJ8xPxv9DFmNe8gI=
 =47Un
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull RCU pathwalk fixes from Al Viro:
 "We still have some races in filesystem methods when exposed to RCU
  pathwalk. This series is a result of code audit (the second round of
  it) and it should deal with most of that stuff.

  Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
  Up to maintainers (a note for NTFS folks - when documentation says
  that a method may not block, it *does* imply that blocking allocations
  are to be avoided. Really)"

[ More explanations for people who aren't familiar with the vagaries of
  RCU path walking: most of it is hidden from filesystems, but if a
  filesystem actively participates in the low-level path walking it
  needs to make sure the fields involved in that walk are RCU-safe.

  That "actively participate in low-level path walking" includes things
  like having its own ->d_hash()/->d_compare() routines, or by having
  its own directory permission function that doesn't just use the common
  helpers.  Having a ->d_revalidate() function will also have this issue.

  Note that instead of making everything RCU safe you can also choose to
  abort the RCU pathwalk if your operation cannot be done safely under
  RCU, but that obviously comes with a performance penalty. One common
  pattern is to allow the simple cases under RCU, and abort only if you
  need to do something more complicated.

  So not everything needs to be RCU-safe, and things like the inode etc
  that the VFS itself maintains obviously already are. But these fixes
  tend to be about properly RCU-delaying things like ->s_fs_info that
  are maintained by the filesystem and that got potentially released too
  early.   - Linus ]

* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4_get_link(): fix breakage in RCU mode
  cifs_get_link(): bail out in unsafe case
  fuse: fix UAF in rcu pathwalks
  procfs: make freeing proc_fs_info rcu-delayed
  procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
  nfs: fix UAF on pathwalk running into umount
  nfs: make nfs_set_verifier() safe for use in RCU pathwalk
  afs: fix __afs_break_callback() / afs_drop_open_mmap() race
  hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
  exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
  affs: free affs_sb_info with kfree_rcu()
  rcu pathwalk: prevent bogus hard errors from may_lookup()
  fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
2024-02-25 09:29:05 -08:00
Linus Torvalds
9b24349279 A couple of fixes - revert of regression from this cycle
and a fix for erofs failure exit breakage (had been there since
 way back).
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdrkZAAKCRBZ7Krx/gZQ
 67D8AP0eM68yZvbThA/Hb5iElDh3Aogt1AW/QAu9/alkDVHr+wD+PKqhamC8WXGk
 b1QZ5AOHQFwzkzdF4738fdbujquBWQE=
 =Ra0D
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "A couple of fixes - revert of regression from this cycle and a fix for
  erofs failure exit breakage (had been there since way back)"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  erofs: fix handling kern_mount() failure
  Revert "get rid of DCACHE_GENOCIDE"
2024-02-25 09:17:15 -08:00
Al Viro
9fa8e282c2 ext4_get_link(): fix breakage in RCU mode
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
0511fdb4a3 cifs_get_link(): bail out in unsafe case
->d_revalidate() bails out there, anyway.  It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup.  Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
053fc4f755 fuse: fix UAF in rcu pathwalks
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.

Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
e31f0a57ae procfs: make freeing proc_fs_info rcu-delayed
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
47458802f6 procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
c1b967d03c nfs: fix UAF on pathwalk running into umount
NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
	server->flags
	server->caps
	*(server->io_stats)
and, worst of all, call
	server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()).  We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
10a973fc4f nfs: make nfs_set_verifier() safe for use in RCU pathwalk
nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry.  For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked().  It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.

That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time.  If that happens, back-to-back attempts to kill dentry and
its parent are quite normal.  Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
275655d320 afs: fix __afs_break_callback() / afs_drop_open_mmap() race
In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
do queue_work(&vnode->cb_work).  In afs_drop_open_mmap() we decrement
->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.

The trouble is, there's nothing to prevent __afs_break_callback() from
seeing ->cb_nr_mmap before the decrement and do queue_work() after both
the decrement and flush_work().  If that happens, we might be in trouble -
vnode might get freed before the queued work runs.

__afs_break_callback() is always done under ->cb_lock, so let's make
sure that ->cb_nr_mmap can change from non-zero to zero while holding
->cb_lock (the spinlock component of it - it's a seqlock and we don't
need to mess with the counter).

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
af072cf683 hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
->d_hash() and ->d_compare() use those, so we need to delay freeing
them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
a13d1a4de3 exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
that is in process of getting shut down.

Besides, having nls and upcase table cleanup moved from ->put_super() towards
the place where sbi is freed makes for simpler failure exits.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
529f89a9e4 affs: free affs_sb_info with kfree_rcu()
one of the flags in it is used by ->d_hash()/->d_compare()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
cdb67fdeed rcu pathwalk: prevent bogus hard errors from may_lookup()
If lazy call of ->permission() returns a hard error, check that
try_to_unlazy() succeeds before returning it.  That both makes
life easier for ->permission() instances and closes the race
in ENOTDIR handling - it is possible that positive d_can_lookup()
seen in link_path_walk() applies to the state *after* unlink() +
mkdir(), while nd->inode matches the state prior to that.

Normally seeing e.g. EACCES from permission check in rcu pathwalk
means that with some timings non-rcu pathwalk would've run into
the same; however, running into a non-executable regular file
in the middle of a pathname would not get to permission check -
it would fail with ENOTDIR instead.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
583340de1d fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
Avoids fun races in RCU pathwalk...  Same goes for freeing LSM shite
hanging off super_block's arse.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Kent Overstreet
c4333eb541 bcachefs: Fix check_snapshot() memcpy
check_snapshot() copies the bch_snapshot to a temporary to easily handle
older versions that don't have all the fields of the current version,
but it lacked a min() to correctly handle keys newer and larger than the
current version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:47:47 -05:00
Kent Overstreet
097471f9e4 bcachefs: Fix bch2_journal_flush_device_pins()
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:46:48 -05:00
Brian Foster
b58b1b883b bcachefs: fix iov_iter count underflow on sub-block dio read
bch2_direct_IO_read() checks the request offset and size for sector
alignment and then falls through to a couple calculations to shrink
the size of the request based on the inode size. The problem is that
these checks round up to the fs block size, which runs the risk of
underflowing iter->count if the block size happens to be large
enough. This is triggered by fstest generic/361 with a 4k block
size, which subsequently leads to a crash. To avoid this crash,
check that the shorten length doesn't exceed the overall length of
the iter.

Fixes:
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:45:24 -05:00
Kent Overstreet
204f45140f bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.

Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).

But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:46 -05:00
Kent Overstreet
04fee68dd9 bcachefs: Kill __GFP_NOFAIL in buffered read path
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the
easy one in read_single_folio() (where wa can return an error) was
missed - oops.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:42 -05:00
Kent Overstreet
1f626223a0 bcachefs: fix backpointer_to_text() when dev does not exist
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:37 -05:00
Linus Torvalds
1c892cdd8f vfs-6.8-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZddOmwAKCRCRxhvAZXjc
 oq1lAQDus0SGgwuwArdHtbbVj+gTs4s5XKvuGI6mqRiLvgvTzwD/TTNnOqJjWacS
 on7XxDHgnjbMR2r90W/MuyPPjtAPkgA=
 =i2E/
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix a memory leak in cachefiles

 - Restrict aio cancellations to I/O submitted through the aio
   interfaces as this is otherwise causing issues for I/O submitted
   via io_uring

 - Increase buffer for afs volume status to avoid overflow

 - Fix a missing zero-length check in unbuffered writes in the
   netfs library. If generic_write_checks() returns zero make
   netfs_unbuffered_write_iter() return right away

 - Prevent a leak in i_dio_count caused by netfs_begin_read() operating
   past i_size. It will return early and leave i_dio_count incremented

 - Account for ipv4 addresses as well as ipv6 addresses when processing
   incoming callbacks in afs

* tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
  afs: Increase buffer size in afs_update_volume_status()
  afs: Fix ignored callbacks over ipv4
  cachefiles: fix memory leak in cachefiles_add_cache()
  netfs: Fix missing zero-length check in unbuffered write
  netfs: Fix i_dio_count leak on DIO read past i_size
2024-02-22 10:06:29 -08:00
Sandeep Dhavale
56ee7db311 erofs: fix refcount on the metabuf used for inode lookup
In erofs_find_target_block() when erofs_dirnamecmp() returns 0,
we do not assign the target metabuf. This causes the caller
erofs_namei()'s erofs_put_metabuf() at the end to be not effective
leaving the refcount on the page.
As the page from metabuf (buf->page) is never put, such page cannot be
migrated or reclaimed. Fix it now by putting the metabuf from
previous loop and assigning the current metabuf to target before
returning so caller erofs_namei() can do the final put as it was
intended.

Fixes: 500edd095648 ("erofs: use meta buffers for inode lookup")
Cc: <stable@vger.kernel.org> # 5.18+
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240221210348.3667795-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-02-22 15:54:21 +08:00
Linus Torvalds
8da8d88455 for-6.8-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXV2z8ACgkQxWXV+ddt
 WDsudxAAoKcp1DbuOtaOzG/XVnIKt36drK4cwyZnGo9PZ9vlgT6k+T0efto4DkOF
 fNWy2d/9iGy9RHy4oxZL6ceb3rcWW0NbhiKHeTPNqL4ZCPa7t6bxMWXSYBh6pYgZ
 6EUS6H9Don05F7rQs8rERc+VIW6u1HFTLn4wS1cmlTyTQzZwlk9B2V6KtDtHBi0k
 B4CCxY6jX2bl7BncXdYteb13Xjg7+JnWvfSKb7ouSVnL8VEcGG13QkPFNV2Xsoi2
 uDDsw+QKBEcPNgBIubBUwbLS5V5vYa1H1meUFJkciaeblHVlVIMN3h7+Y8VNKMTC
 qpxEo3Hx6oqmw9LdEIU7WsvFs0JJum2fKOjOx3vr1d3AiyFG6W6lrm1fKwnp3dt7
 dHdAYuo8+Q4rirGlMDcEoYqpy7AcV8QqtSYajdrdpB1dqHcHhukSNqJ0dx5lYElU
 HtnMXD9vLc4uJDcl9Z1aTWEmB+7nj5HwukSnTqQhgwpCZM7mrz6pe1DuD5iKinG6
 Yth9QFhgcbcnchPA+3SOtC6uuje9chHo6L6eTVacnKAKyKgLY8qTTsh3zYQNVhMX
 M2aWcAkizq20vKe7JFxs7M/tClyuswTjOP6RYzeayY21Rn8gGwe7uhXhv5MKpEh5
 TjXjiixsrxxsyyaED7Kl69i54BvmM/TI35p7Jbx6Ln7PPD8lf8M=
 =pXyL
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - Fix a deadlock in fiemap.

   There was a big lock around the whole operation that can interfere
   with a page fault and mkwrite.

   Reducing the lock scope can also speed up fiemap

 - Fix range condition for extent defragmentation which could lead to
   worse layout in some cases

* tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix deadlock with fiemap and extent locking
  btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
2024-02-21 08:45:07 -08:00
Bart Van Assche
b820de741a
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
following kernel warning appears:

WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
Call trace:
 kiocb_set_cancel_fn+0x9c/0xa8
 ffs_epfile_read_iter+0x144/0x1d0
 io_read+0x19c/0x498
 io_issue_sqe+0x118/0x27c
 io_submit_sqes+0x25c/0x5fc
 __arm64_sys_io_uring_enter+0x104/0xab0
 invoke_syscall+0x58/0x11c
 el0_svc_common+0xb4/0xf4
 do_el0_svc+0x2c/0xb0
 el0_svc+0x2c/0xa4
 el0t_64_sync_handler+0x68/0xb4
 el0t_64_sync+0x1a4/0x1a8

Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
submitted by libaio.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-21 16:31:49 +01:00
Daniil Dulov
6ea38e2aeb
afs: Increase buffer size in afs_update_volume_status()
The max length of volume->vid value is 20 characters.
So increase idbuf[] size up to 24 to avoid overflow.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

[DH: Actually, it's 20 + NUL, so increase it to 24 and use snprintf()]

Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240211150442.3416-1-d.dulov@aladdin.ru/ # v1
Link: https://lore.kernel.org/r/20240212083347.10742-1-d.dulov@aladdin.ru/ # v2
Link: https://lore.kernel.org/r/20240219143906.138346-3-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-20 09:51:21 +01:00