Commit Graph

83975 Commits

Author SHA1 Message Date
Christian Brauner
56d2e2cfa2 ceph: allow idmapped mounts
Now that we converted cephfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
8a051b40ab ceph: allow idmapped atomic_open inode op
Enable ceph_atomic_open() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to 5fadbd9929 ("ceph: rely on vfs for
  setgid stripping") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
2cce72dda2 ceph: allow idmapped set_acl inode op
Enable ceph_set_acl() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
a04aff2588 ceph: allow idmapped setattr inode op
Enable __ceph_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to b27c82e129 ("attr: port attribute
  changes to new types") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn
79c66a0c8c ceph: pass idmap to __ceph_setattr
Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
8995375fae ceph: allow idmapped permission inode op
Enable ceph_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
0513043ec4 ceph: allow idmapped getattr inode op
Enable ceph_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Christian Brauner
09838f1bfd ceph: pass an idmapping to mknod/symlink/mkdir
Enable mknod/symlink/mkdir iops to handle idmapped mounts.
This is just a matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:34 +01:00
Alexander Mikhalitsyn
673478b6e5 ceph: add enable_unsafe_idmap module parameter
This parameter is used to decide if we allow
to perform IO on idmapped mount in case when MDS lacks
support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature.

In this case we can't properly handle MDS permission
checks and if UID/GID-based restrictions are enabled
on the MDS side then IO requests which go through an
idmapped mount may fail with -EACCESS/-EPERM.
Fortunately, for most of users it's not a case and
everything should work fine. But we put work "unsafe"
in the module parameter name to warn users about
possible problems with this feature and encourage
update of cephfs MDS.

Suggested-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner
5ccd8530dd ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.

In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.

This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].

[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/

Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner
9c2df2271c ceph: stash idmapping in mdsc request
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.

If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.

In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.

In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Alexander Mikhalitsyn
1b90344614 fs: export mnt_idmap_get/mnt_idmap_put
These helpers are required to support idmapped mounts in CephFS.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
522dc5108f libceph, ceph: move mdsmap.h to fs/ceph
The mdsmap.h is only used by CephFS, so move it to fs/ceph.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
38d46409c4 ceph: print cluster fsid and client global_id in all debug logs
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
5995d90d2d ceph: rename _to_client() to _to_fs_client()
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li
197b7d792d ceph: pass the mdsc to several helpers
We will use the 'mdsc' to get the global_id in the following commits.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Linus Torvalds
d1b0949f23 assorted fixes all over the place
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZTxSwQAKCRBZ7Krx/gZQ
 6zadAP9o/724KPDCY3ybgwKyEQ1UNjHTriFRBeoF3o2q0WgidwEA+/xS0Xk3i25w
 xnSZO/8My1edE1IcK/JDwewH/J+4Kw0=
 =N/Lv
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc filesystem fixes from Al Viro:
 "Assorted fixes all over the place: literally nothing in common, could
  have been three separate pull requests.

  All are simple regression fixes, but not for anything from this cycle"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
  io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
  sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27 16:44:58 -10:00
Al Viro
dc32464a5f ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
Use of dget() after we'd dropped ->d_lock is too late - dentry might
be gone by that point.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-27 20:14:38 -04:00
Linus Torvalds
d88520ad73 fix for lock_rename() misuse in nfsd
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZS4N8wAKCRBZ7Krx/gZQ
 65q9AQDhucfo26czFALs6aOceZ1K+FUu3OzgU0gbQaCCLhuubwD/Uu3GXL2KrVaj
 uMk7Wv6a68/j1VXwtNMpSb0MV09j/wM=
 =xKoB
 -----END PGP SIGNATURE-----

Merge tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull nfsd fix from Al Viro:
 "Catch from lock_rename() audit; nfsd_rename() checked that both
  directories belonged to the same filesystem, but only after having
  done lock_rename().

  Trivial fix, tested and acked by nfs folks"

* tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: lock_rename() needs both directories to live on the same fs
2023-10-23 20:40:04 -10:00
Linus Torvalds
e017769f4c for-6.6-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU2lLEACgkQxWXV+ddt
 WDvCThAApe+zMNdEhQ/cgrvfzP/X91Q53PXQsdVsrujPyUV8eEV4oJzEwVbJhRdw
 3ukIQtvyAMNiWhEBhOQRwxjuUoTCApGAeEEEl1cWWEqQ7G2/2LS4+bcWzgQ3Vu32
 dzYL37ddsfe4n7OgfnymtMrnv7kge0XbAlY3GbavaDccZDQDqcD5wSAOyOhfIsH7
 kcu4sA5Fi44wVSfAJX1Dms+wXfsmQu/sd3c9Gcyce9Hpy1cEW3vWbApLBE4K0aKX
 /JHTdmkAJ20a4APQsfGH+UymyuZgr8d2eGmL9rVYKhT/c+Dow0lNAWYkvGf/MawM
 CX3GdP6f6ZOR/anCPZ8nqZCE5AoFykGazvpCCSrvCOpU7o7GqxbAQkWWFcMp1FHW
 9TFrj81WK18DeCfCNw7lR3sdMy/2o2nnSUAw3DFY4n/3Lek7FUmrBTHvXlWDot7T
 TM9CzYGF840QhL5s5SMYS09YmeI0I34L7HJAi/+qli48SooGuL9RZ29TmzHIX69Y
 2bgpS64j06p/AGEnfHAcT1LbpiFCPmO5cpXKv/t40GL5QO5d4WV698ysDGoPYUPO
 8CPL85Y8cao56KGJLyOroGz0P1bo+RdNe5bN6xJJoTRn1Y9oUA+bQSnN8x9iuunF
 9QZrAIHzNyDcRGzoqgDW+3bivOvIus/Dto/u1P3ap68kP2HTVsY=
 =gOyi
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix for a problem with snapshot of a newly created subvolume
  that can lead to inconsistent data under some circumstances. Kernel
  6.5 added a performance optimization to skip transaction commit for
  subvolume creation but this could end up with newer data on disk but
  not linked to other structures.

  The fix itself is an added condition, the rest of the patch is a
  parameter added to several functions"

* tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix unwritten extent buffer after snapshotting a new subvolume
2023-10-23 07:59:13 -10:00
Filipe Manana
eb96e22193 btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
   address 36007936, for the leaf/root of a new subvolume that has an ID
   of 291. We mark the extent buffer as dirty, and at this point the
   subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
   create_subvol(). We used to, but that was recently removed in
   commit 1b53e51a4a ("btrfs: don't commit transaction for every subvol
   create");

3) The transaction used to create the subvolume has an ID of 33, so the
   extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
   files created and its tree height changes from 0 to 1, so we end up with
   a new root at level 1 and the extent buffer 36007936 is now a leaf of
   that new root node, which is extent buffer 36048896.

   The commit root remains as 36007936, since we are still at transaction
   33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
   ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
   we end up at transaction.c:create_pending_snapshot(), in the critical
   section of a transaction commit.

   There we COW the root of subvolume 291, which is extent buffer 36048896.
   The COW operation returns extent buffer 36048896, since there's no need
   to COW because the extent buffer was created in this transaction and it
   was not written yet.

   The we call btrfs_copy_root() against the root node 36048896. During
   this operation we allocate a new extent buffer to turn into the root
   node of the snapshot, copy the contents of the root node 36048896 into
   this snapshot root extent buffer, set the owner to 292 (the ID of the
   snapshot), etc, and then we call btrfs_inc_ref(). This will create a
   delayed reference for each leaf pointed by the root node with a
   reference root of 292 - this includes a reference for the leaf
   36007936.

   After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

   Then we call btrfs_insert_dir_item(), to create the directory entry in
   in the tree of subvolume 291 that points to the snapshot. This ends up
   needing to modify leaf 36007936 to insert the respective directory
   items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
   we need to COW the leaf. We end up at btrfs_force_cow_block() and then
   at update_ref_for_cow().

   At update_ref_for_cow() we call btrfs_block_can_be_shared() which
   returns false, despite the fact the leaf 36007936 is shared - the
   subvolume's root and the snapshot's root point to that leaf. The
   reason that it incorrectly returns false is because the commit root
   of the subvolume is extent buffer 36007936 - it was the initial root
   of the subvolume when we created it. So btrfs_block_can_be_shared()
   which has the following logic:

   int btrfs_block_can_be_shared(struct btrfs_root *root,
                                 struct extent_buffer *buf)
   {
       if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
           buf != root->node && buf != root->commit_root &&
           (btrfs_header_generation(buf) <=
            btrfs_root_last_snapshot(&root->root_item) ||
            btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
               return 1;

       return 0;
   }

   Returns false (0) since 'buf' (extent buffer 36007936) matches the
   root's commit root.

   As a result, at update_ref_for_cow(), we don't check for the number
   of references for extent buffer 36007936, we just assume it's not
   shared and therefore that it has only 1 reference, so we set the local
   variable 'refs' to 1.

   Later on, in the final if-else statement at update_ref_for_cow():

   static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                          struct btrfs_root *root,
                                          struct extent_buffer *buf,
                                          struct extent_buffer *cow,
                                          int *last_ref)
   {
      (...)
      if (refs > 1) {
          (...)
      } else {
          (...)
          btrfs_clear_buffer_dirty(trans, buf);
          *last_ref = 1;
      }
   }

   So we mark the extent buffer 36007936 as not dirty, and as a result
   we don't write it to disk later in the transaction commit, despite the
   fact that the snapshot's root points to it.

   Attempting to access the leaf or dumping the tree for example shows
   that the extent buffer was not written:

   $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
   btrfs-progs v6.2.2
   file tree key (292 ROOT_ITEM 33)
   node 36110336 level 1 items 2 free space 119 generation 33 owner 292
   node 36110336 flags 0x1(WRITTEN) backref revision 1
   checksum stored a8103e3e
   checksum calced a8103e3e
   fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
           key (256 INODE_ITEM 0) block 36007936 gen 33
           key (257 EXTENT_DATA 0) block 36052992 gen 33
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   total bytes 107374182400
   bytes used 38572032
   uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

   The respective on disk region is full of zeroes as the device was
   trimmed at mkfs time.

   Obviously 'btrfs check' also detects and complains about this:

   $ btrfs check /dev/sdb
   Opening filesystem to check...
   Checking filesystem on /dev/sdb
   UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   generation: 33 (33)
   [1/7] checking root items
   [2/7] checking extents
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   owner ref check failed [36007936 4096]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   The following tree block(s) is corrupted in tree 292:
        tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
   root 292 root dir 256 not found
   ERROR: errors found in fs roots
   found 38572032 bytes used, error(s) found
   total csum bytes: 16048
   total tree bytes: 1265664
   total fs tree bytes: 1118208
   total extent tree bytes: 65536
   btree space waste bytes: 562598
   file data blocks allocated: 65978368
    referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

   $ cat test.sh
   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   # Use a filesystem with a 64K node size so that we have the same node
   # size on every machine regardless of its page size (on x86_64 default
   # node size is 16K due to the 4K page size, while on PPC it's 64K by
   # default). This way we can make sure we are able to create a btree for
   # the subvolume with a height of 2.
   mkfs.btrfs -f -n 64K $DEV
   mount $DEV $MNT

   btrfs subvolume create $MNT/subvol

   # Create a few empty files on the subvolume, this bumps its btree
   # height to 2 (root node at level 1 and 2 leaves).
   for ((i = 1; i <= 300; i++)); do
       echo -n > $MNT/subvol/file_$i
   done

   btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

   umount $DEV

   btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

   $ ./test.sh
   Create subvolume '/mnt/sdi/subvol'
   Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
   Opening filesystem to check...
   Checking filesystem on /dev/sdi
   UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
   [1/7] checking root items
   [2/7] checking extents
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   owner ref check failed [30539776 65536]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
   Wrong generation of child node/leaf, wanted: 5, have: 7
   root 257 root dir 256 not found
   ERROR: errors found in fs roots
   found 917504 bytes used, error(s) found
   total csum bytes: 0
   total tree bytes: 851968
   total fs tree bytes: 393216
   total extent tree bytes: 65536
   btree space waste bytes: 736550
   file data blocks allocated: 0
    referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-23 17:17:30 +02:00
Linus Torvalds
5722119f67 Bug fixes for 6.6-rc6:
* Fix a bug where a writev consisting of a bunch of sub-fsblock writes
   where the last buffer address is invalid could lead to an infinite
   loop.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZTFcYAAKCRBKO3ySh0YR
 pg6NAP4zKkrjdzOKyhu+NJDu4lnB4+8eq0PPnW5r5fw4155wTwEA1PKWp5vtKYRk
 A3QLraLNS0HxpsguHpfD8KFrPN6LPA8=
 =93ew
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fix from Darrick Wong:

 - Fix a bug where a writev consisting of a bunch of sub-fsblock writes
   where the last buffer address is invalid could lead to an infinite
   loop

* tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix short copy in iomap_write_iter()
2023-10-21 09:43:09 -07:00
Linus Torvalds
f74e3ea3ba A few more NFS Client Bugfixes for Linux 6.6-rc
Stable Fix:
   * Fix a pNFS hang in nfs4_evict_inode()
 
 Bugfixes:
   * Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
   * Fix a potential oops in nfs_inode_remove_request()
   * Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
   * Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmUy4cAACgkQ18tUv7Cl
 QOstABAAtK5pXSbzXXQY1IIVIxgIT9tGRMPwd+IVUwL0qRIhTqGkiax6UuZVH9id
 NxrTQFrrxC4np2X4atYdC4LggOqx6wcdNLQ9oJ56zULHUtycgjqCSjFrKbA3nFa6
 tn4EccPvKzjsmsbJW1rCq97XIRqZY1XotjDA3MR72YSnsOncdoLr6ng6virjOxWe
 t8K8WS4tScD0kwvclQLpsEDSFEEdjTYORdS3RVXVz4crM45gikmTI7bSyZ4vc0lT
 TafhFYMjcHUNyKF8MP0DguYSGTtHxCHuM/0yYZw0ZiqLdZeT7Pk/GVecV6qmMD9W
 REkidcFLV53RR4qjUvL0HuM71/cX+yfwGaj6e8nZg5b7s1pJB46IG58ZmQaPwyzP
 n9/zxZHIRrNSkj6e+cwFpgpz7v7wjTOsXVUqBLE9q0Rsm71L8TuzW0G/lghY8tE+
 TZ4rMVlZnypFybELmEgoRXRZa0G41SeQmKYNMv+wJIwpNyxbKkI3FgpAx0A4RuyP
 prYK4S55GxqKiOrP22USHRpvkhxdzeIkmnPGuy/A04VYaVJwANjUI0w0SAgg1/Oj
 iNLjvO80wIL7d3fRLpjcjiQ2DtVt63WcnMgR0dV138959vh9CKAF7u7xVdW4Zsk7
 cO5/WAFR5cUh2veswsi69+oivXweP57pi0XWmiJamKKckS/PPuQ=
 =Z3lY
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Stable Fix:
   - Fix a pNFS hang in nfs4_evict_inode()

  Fixes:
   - Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
   - Fix a potential oops in nfs_inode_remove_request()
   - Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
   - Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases"

* tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server
  pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats
  pNFS: Fix a hang in nfs4_evict_inode()
  NFS: Fix potential oops in nfs_inode_remove_request()
  nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op
2023-10-20 14:04:53 -07:00
Linus Torvalds
0e97fd2910 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmUyw0wACgkQnJ2qBz9k
 QNknFgf/TMbqrnLyro0JUY6w4b9mgXTGFqqnaSuopOGK31vlpfiNR6mYw+vGv82E
 FcN4CSQFBIK8v7DU4pgCzbD1OxGIdJuWz7tjnI4ntr/jMM+3pqqKYhu+VkKInrBB
 HEIMe/WjM0/LbX/wid6xdT3Bcz6lbAySXJFVtxU45umkuv8RGODCAr6Gf1jX3q7m
 LsIv8ESCmau5hyesp1Te4N8bv7dK8x3FPpaX12BB8DkuRlaqmzwHXc0ExpMRhII8
 LBllG2rUIu2GNx8AqWULw9LyBsNaZSeAF2iUl5taXaDXw8Js8eQzH/Y+wS5KaJNa
 M7kszLlAByav/MSuUWWJHOqwgMhhDQ==
 =bOXL
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fanotify fix from Jan Kara:
 "Disable superblock / mount marks for filesystems that can encode file
  handles but not open them (currently only overlayfs).

  It is not clear the functionality is useful in any way so let's better
  disable it before someone comes up with some creative misuse"

* tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: limit reporting of event with non-decodeable file handles
2023-10-20 14:00:05 -07:00
Jan Stancek
3ac974796e iomap: fix short copy in iomap_write_iter()
Starting with commit 5d8edfb900 ("iomap: Copy larger chunks from
userspace"), iomap_write_iter() can get into endless loop. This can
be reproduced with LTP writev07 which uses partially valid iovecs:
        struct iovec wr_iovec[] = {
                { buffer, 64 },
                { bad_addr, 64 },
                { buffer + 64, 64 },
                { buffer + 64 * 2, 64 },
        };

commit bc1bb416bb ("generic_perform_write()/iomap_write_actor():
saner logics for short copy") previously introduced the logic, which
made short copy retry in next iteration with amount of "bytes" it
managed to copy:

                if (unlikely(status == 0)) {
                        /*
                         * A short copy made iomap_write_end() reject the
                         * thing entirely.  Might be memory poisoning
                         * halfway through, might be a race with munmap,
                         * might be severe memory pressure.
                         */
                        if (copied)
                                bytes = copied;

However, since 5d8edfb900 "bytes" is no longer carried into next
iteration, because it is now always initialized at the beginning of
the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with
same value as previous iteration, making the loop retry same copy
over and over, which leads to writev07 testcase hanging.

Make next iteration retry with amount of bytes we managed to copy.

Fixes: 5d8edfb900 ("iomap: Copy larger chunks from userspace")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-19 09:41:36 -07:00
Linus Torvalds
ea1cc20cd4 v6.6-rc7.vfs.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
 opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
 51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
 =Dqrj
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fix from Christian Brauner:
 "An openat() call from io_uring triggering an audit call can apparently
  cause the refcount of struct filename to be incremented from multiple
  threads concurrently during async execution, triggering a refcount
  underflow and hitting a BUG_ON(). That bug has been lurking around
  since at least v5.16 apparently.

  Switch to an atomic counter to fix that. The underflow check is
  downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
  remove that check altogether tbh"

* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  audit,io_uring: io_uring openat triggers audit reference count underflow
2023-10-19 09:37:41 -07:00
Linus Torvalds
f69d00d12f driver ntfs3 for linux 6.6
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmUwyyAACgkQqbAzH4Mk
 B7aQsg/9FohvuET2YOqYlmtgbq1YjFlWkVxoNB2Be4gxKEqRcFN4V/O8PrOD0ZBX
 KH8ASWGp9EKQlVKucAPUhSaZMv4u2HaKTVyRnH5KpUmNG7rC7EzRgYDY4xA5RfjL
 aIxgVOhaSibFBGfys2KoS8MHFgZ9mcCWZZ1154ueSkZfX9ghytZcql5i4Ahi2OGQ
 u+aa/tkhoIS9ZKWp/uIz4R+fDClrG8UoiWmJ1LoOHQ5YM3enboOld6Zz1CjbC9yN
 ldInX7klL9aHRf3SoAeRVCyL23e4GB5R4+QmBhZQemhwy0jF0/7E0kCmCK0wYFWn
 3t1M/+i0S7o7rI+utTA2mTuZzxnNPMXJsn2TnpH9vv1o633WOlL0mrvsM7Z/NJDz
 VShPinQ9eJGvZQ9kYuMfZODtK0BP7OV8mRi7aXinzHucG+ISB2fBGxsnA0lX9EfT
 fHix1oNI33lfazVzaStJ3hs/n9KXwP/9P49rBcLh75O2t3TBKc2T84qpywNx+W6S
 9GDMV5V2FNWf/hMCL8pcTtab3MvpbeU5DqKnxW1D1vu2vdGiHVum/Aq26aE5W7nl
 EcE5pjKppdNZgk3Qwj190vym7d8lbK4Esua6ieNM/qvWb7hAGoZO7Tjcypvj+YMY
 W4aiJTW/jULPT2tkUvLvHg/r5DFHi30v6eCeTCrPfivxFRIAzBQ=
 =ScrJ
 -----END PGP SIGNATURE-----

Merge tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 fixes from Konstantin Komarov:

 - memory leak

 - some logic errors, NULL dereferences

 - some code was refactored

 - more sanity checks

* tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3:
  fs/ntfs3: Avoid possible memory leak
  fs/ntfs3: Fix directory element type detection
  fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e()
  fs/ntfs3: Fix OOB read in ntfs_init_from_boot
  fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea()
  fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame()
  fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr()
  fs/ntfs3: Do not allow to change label if volume is read-only
  fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo
  fs/ntfs3: Refactoring and comments
  fs/ntfs3: Fix alternative boot searching
  fs/ntfs3: Allow repeated call to ntfs3_put_sbi
  fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime
  fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super
  fs/ntfs3: fix deadlock in mark_as_free_ex
  fs/ntfs3: Add more attributes checks in mi_enum_attr()
  fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN)
  fs/ntfs3: Write immediately updated ntfs state
  fs/ntfs3: Add ckeck in ni_update_parent()
2023-10-19 09:10:18 -07:00
Linus Torvalds
7cf4bea77a for-6.6-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUuntEACgkQxWXV+ddt
 WDssdQ/9Fo6tN+MCH5ISAcvsW6WvBUWT62MrDnzawh96QhUf3NYf9sjME7QqwHHv
 w60SDiqRlAd5UzxdIPC4Qa/6GVZZh2yFLzew3l8Fh6anxhjO5argdsfx1Wv4ADk/
 FHI8zs6EZiTlk0JmEnHNclliZfaDutBRQiPL+HZx4+FCrJweS5U/4Jpg7vdfp/tp
 eWdJ51pDM8iyqGTsP7a7/VaL5wLoJhbdD9wYgupZUhvY6g2tCZ71/hNiWdbKtCK8
 EyQxXiAlc+k1UflOx6Xip1HLIh6HmKwxntXxRy+yj4IvJ3PhI+KS5Nqdl35TszN9
 6y9MRo3oCU+2y89Yay4HZZb6DLxcAi6VwpyswnntodFQ+ICXEw7ZaNi3rSO+FCO8
 KxfhLniMD5gflRP4gy+o9iZxgVQ75nmiPgBt53r+sAKZ7lv86x84DJ/ZUqL8EV0e
 OJhxdzhoT0Ks8OstIuE87fgzUCjqMcgAavxcn1psKBC6/JY9v6OneA8qauSswkKs
 P+diJIqZHHOBQVKFedqdIrDU6AstivSBq0ToPBslbBlcy97EO4IRoiMIw+QgHPYn
 CHsPHtooBmxPyw+4HTFuzY1NIrSeUFYxTDAs9p5kMPmltkVAlLPcrpGZVya9tjds
 l/YuwY2f0C9Q1pjcAc9FcN8Y5kLRCYNEWMl0M1VpC22KgjRN6r0=
 =GrNu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "Fix a bug in chunk size decision that could lead to suboptimal
  placement and filling patterns"

* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix stripe length calculation for non-zoned data chunk allocation
2023-10-19 08:56:01 -07:00
Amir Goldstein
97ac489775 fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a7 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.

At the time that this commit was merged, there were no filesystems
in-tree with those traits.

Commit 16aac5ad1f ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.

In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).

For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.

Fixes: a95aef69a7 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-19 16:19:20 +02:00
Olga Kornievskaia
379e4adfdd NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server
This patches fixes commit 51d674a5e4 "NFSv4.1: use
EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that
commit was to mark EXCHANGE_ID to the DS with the appropriate
flag.

However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS
and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only
remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that
to the MDS.

Instead, just mark the pnfs path exclusively.

Fixes: 51d674a5e4 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18 15:16:53 -04:00
Trond Myklebust
e1c6cfbb3b pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats
Ensure that we check the layout pointer and validity after dereferencing
it in ff_layout_mirror_prepare_stats.

Fixes: 08e2e5bc6c ("pNFS/flexfiles: Clean up layoutstats")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18 14:42:09 -04:00
Trond Myklebust
f63955721a pNFS: Fix a hang in nfs4_evict_inode()
We are not allowed to call pnfs_mark_matching_lsegs_return() without
also holding a reference to the layout header, since doing so could lead
to the reference count going to zero when we call
pnfs_layout_remove_lseg(). This again can lead to a hang when we get to
nfs4_evict_inode() and are unable to clear the layout pointer.

pnfs_layout_return_unused_byserver() is guilty of this behaviour, and
has been seen to trigger the refcount warning prior to a hang.

Fixes: b6d49ecd10 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-18 14:41:52 -04:00
Al Viro
1aee9158bc nfsd: lock_rename() needs both directories to live on the same fs
... checking that after lock_rename() is too late.  Incidentally,
NFSv2 had no nfserr_xdev...

Fixes: aa387d6ce1 "nfsd: fix EXDEV checking in rename"
Cc: stable@vger.kernel.org # v3.9+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-17 00:24:35 -04:00
Zygo Blaxell
8a540e990d btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit f6fca3917b "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size.  Commit 5da431b71d "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.

After commit f6fca3917b and 5da431b71d, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:

        1.  btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
        space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
        unmodified.  Before f6fca3917b, ctl->max_stripe_len value was
        1 GiB for non-zoned data chunks and not configurable.

        2.  btrfs_create_chunk calls gather_device_info which consumes
        and produces more fields of chunk_ctl.

        3.  gather_device_info multiplies ctl->max_stripe_len by
        ctl->dev_stripes (which is 1 in all cases except dup)
        and calls find_free_dev_extent with that number as num_bytes.

        4.  find_free_dev_extent locates the first dev_extent hole on
        a device which is at least as large as num_bytes.  With default
        max_chunk_size from f6fca3917b, it finds the first hole which is
        longer than 10 GiB, or the largest hole if that hole is shorter
        than 10 GiB.  This is different from the pre-f6fca3917b4d
        behavior, where num_bytes is 1 GiB, and find_free_dev_extent
        may choose a different hole.

        5.  gather_device_info repeats step 4 with all devices to find
        the first or largest dev_extent hole that can be allocated on
        each device.

        6.  gather_device_info sorts the device list by the hole size
        on each device, using total unallocated space on each device to
        break ties, then returns to btrfs_create_chunk with the list.

        7.  btrfs_create_chunk calls decide_stripe_size_regular.

        8.  decide_stripe_size_regular finds the largest stripe_len that
        fits across the first nr_devs device dev_extent holes that were
        found by gather_device_info (and satisfies other constraints
        on stripe_len that are not relevant here).

        9.  decide_stripe_size_regular caps the length of the stripe it
        computed at 1 GiB.  This cap appeared in 5da431b71d to correct
        one of the other regressions introduced in f6fca3917b.

        10.  btrfs_create_chunk creates a new chunk with the above
        computed size and number of devices.

At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.

Consider a filesystem using raid1 profile with 3 devices.  After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:

        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
        Device 2:  [__________] (10 GiB contig unallocated)
        Device 3:  [__________] (10 GiB contig unallocated)

Before f6fca3917b, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):

        [after 0 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [__________] (10 GiB)
        Device 3:  [__________] (10 GiB)

        [after 1 chunk allocation]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [__________] (10 GiB)

        [after 2 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [after 3 chunk allocations]
        Device 1:  [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
        Device 2:  [12] [12] [________] (8 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [...]

        [after 12 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 13 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 14 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)

        [after 15 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)

This allocates all of the space with no waste.  The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space.  This keeps
usable space on each device equal so they can all be filled completely.

After f6fca3917b, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:

        [after 1 chunk allocation]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [_________] (9 GiB)
        Device 3:  [23] [_________] (9 GiB)

        [after 2 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [________] (8 GiB)
        Device 3:  [23] [23] [________] (8 GiB)

        [after 3 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [_______] (7 GiB)
        Device 3:  [23] [23] [23] [_______] (7 GiB)

        [...]

        [after 9 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 10 chunk allocations]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 11 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)

No further allocations are possible, with 8 GiB wasted (4 GiB of data
space).  The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.

Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.

Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.

To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).

While researching the background of the earlier commits, I found that an
identical fix was already proposed at:

  https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/

The previous review missed one detail:  ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect.  ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.

Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-15 19:00:59 +02:00
Linus Torvalds
19fd4a91dd overlayfs fixes for 6.6-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmUrsD8ACgkQEVvVyTe/
 1WpCmg//XfJm6TdeFFuoEoYezytHXnEHnIM+kHMOqmtY1AoaW4884dvevKOCBsUF
 BXXfnIsJc1GmKBDCOxaWFXTRN9iLHTLww8dZzJFX1vDYsqo96t7TbbssyOuOy9vp
 K4uYX/pZzuggyBCUZvz/UdjZPEAwPVmsU/uXe/gkHQLS+wpOH0e7hOB7p91I0iZx
 0JdleyoDyw2AtqWLscGrTYqozW+lNMl0smADWaGfLcn29rRkR10mhq5heYa+4b2C
 TiG29rpIDgXyhz9w+Tq4f3oxJDrE1qlQGX7G8tJjISS9UDHxrZqFun0+nr15Y8Ge
 3YstSaMlarAx+UGtuEX80QcYHlYDAWnCKwkRAD8wBtKLeO0pG6BSmCCVi34CAoN+
 NBy5KQeoDV96vqoIc8lDmXWOgOkzOogJzzspOWT3H7jmonTjkYL/rYGuV1V6lpuk
 Ngopc7HxjiZeGb9Zchr1KOT6xJzjYm74/Ph8I9ECPAamgO6UxlOcsBF2uc7OZIlL
 Jzv5Hl+ITiXzxa/KQonx2nEtdO8INf5wxHjy4nlnnY0dcibGCA1wi/3/8/vqJBIf
 tuRoQJ/eHruTf23dwPRFwo4JoOViSv3Up9GFyEodllBA6DSAHK96low8eeEZAw9e
 MeyWy45dewgH1jvx3bb8Sd29CVxUXvMEl33hYngu09/cv7XulMw=
 =9LTH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs fixes from Amir Goldstein:

 - Various fixes for regressions due to conversion to new mount
   api in v6.5

 - Disable a new mount option syntax (append lowerdir) that was
   added in v6.5 because we plan to add a different lowerdir
   append syntax in v6.7

* tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: temporarily disable appending lowedirs
  ovl: fix regression in showing lowerdir mount option
  ovl: fix regression in parsing of mount options with escaped comma
  fs: factor out vfs_parse_monolithic_sep() helper
2023-10-15 08:55:51 -07:00
Linus Torvalds
9a3dad63ed Four SMB3 server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPnIACgkQiiy9cAdy
 T1FTewv/ZgF5CX5M8EpLvUENcdzJQIh2JolP2PeSbqVjf/pI/LKrkGP6fHEo8+S2
 Mnntxik3aNZ3hQ755SpGw+mT4hgr2umsDDrZxLVjvDDiLxqW9zlr55JdBS5xJvxN
 enKJ8wDbP+Usn4Gb0TfY9xrPWgHyYSn9+dYoPYuh1Z1zqmhFfIRpGSBHbwUM6Ssa
 vONpZUdnzdRnuHKyU3+xgU4Pr6KZLnriM/iGgjncCHJRCem0f27W50xsXkkpoVIg
 GIhNRe//YL1dlqbpQpXY+4+KI/3d2JRZeVpnzCJ7ucuwyjq5KNJSnTI7Jsgqnf3V
 ADe9m/HnknOG3lkQrzojxTNGLmXqlvsxUUNVtjGRccAHaOJDCsKA5Wv5L2jJahdP
 ynuXz5iwtQqaHPkfIL5D48RYiMTVemmsHdP+cdnleXkcU8GpN7cd2Gnt591Wo8by
 lcRS01pMlRfSh6SyKcDghEHbb2BDKyRSbIlvsy+85CBdKAqOpJXl/96p67UPcQjI
 /K9g3cY6
 =X4B/
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - Fix for possible double free in RPC read

 - Add additional check to clarify smb2_open path and quiet Coverity

 - Fix incorrect error rsp in a compounding path

 - Fix to properly fail open of file with pending delete on close

* tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix potential double free on smb2_read_pipe() error path
  ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()
  ksmbd: fix wrong error response status by using set_smb2_rsp_status()
  ksmbd: not allow to open file if delelete on close bit is set
2023-10-14 19:50:39 -07:00
Linus Torvalds
bf2069d151 two SMB3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPK0ACgkQiiy9cAdy
 T1FULQv+IFTddGvRcreJFFeq5pndSGNuN4gs2dUkc+wQaS80wH1TI/tmmOB8DNx7
 amXhPv4bo9inbus0ZAZLlOY6Vr7bAhQTGjXLJEVwes+yYxpxbJVmNt9aPCA6yiaP
 OAC5jKUEIdIgQkcwpSaUYkiCayB0i38R300HSFLFk2XqPOt5QRCIxrBm5Ssq7yN2
 OckglUl0p+FY54kI4kL6sfXi7tRV6rqjzofIxJJuiUqU3IUJk+G1b/pNouWLhEpi
 wQPmpdcRvV70QGinzdsuVWNQV8HEZ+zvQQkdiIqKPqEl1hMzndFYMfSNnvDcDN+W
 xMmcnIW0ei/3qjYDZ3wjnMEH/978nBrQEtUL/dpAXOmVr+U72NKQE22IlWEICSo8
 GZ7JJxcB20WHNHzSFH1+Y+7kgNTgwLd59uK27xqm1/8+qcvhf/MMnjA/ym9F4pyq
 4IEHPy7keQ5apSdBhtItJxg5etjpjujQomk1yRxrIU1JAnF0LpL/BFX85aT19QxX
 hbbqWeyd
 =ZlF9
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix caching race with open_cached_dir and laundromat cleanup of
   cached dirs (addresses a problem spotted with xfstest run with
   directory leases enabled)

 - reduce excessive resource usage of laundromat threads

* tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: prevent new fids from being removed by laundromat
  smb: client: make laundromat a delayed worker
2023-10-14 19:44:15 -07:00
Amir Goldstein
beae836e9c ovl: temporarily disable appending lowedirs
Kernel v6.5 converted overlayfs to new mount api.
As an added bonus, it also added a feature to allow appending lowerdirs
using lowerdir=:/lower2,lowerdir=::/data3 syntax.

This new syntax has raised some concerns regarding escaping of colons.
We decided to try and disable this syntax, which hasn't been in the wild
for so long and introduce it again in 6.7 using explicit mount options
lowerdir+=/lower2,datadir+=/data3.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegsr3A4YgF2YBevWa6n3=AcP7hNndG6EPMu3ncvV-AM71A@mail.gmail.com/
Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-14 22:42:45 +03:00
Linus Torvalds
70f8c6f8f8 Bug fixes for 6.6-rc6:
* Fix calculation of offset of AG's last block and its length.
 
 * Update incore AG block count when shrinking an AG.
 
 * Process free extents to busy list in FIFO order.
 
 * Make XFS report its i_version as the STATX_CHANGE_COOKIE.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZSjJjQAKCRAH7y4RirJu
 9HK3AQDLHHw5nlrVShiOtps543OOJ1uBA2PO/lxvvM3X6GPbbwEA8uYLqNPThpF2
 +a5h8y9LV6uTEDsdpSoEPBmCgXKhYQg=
 =RmDo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix calculation of offset of AG's last block and its length

 - Update incore AG block count when shrinking an AG

 - Process free extents to busy list in FIFO order

 - Make XFS report its i_version as the STATX_CHANGE_COOKIE

* tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE
  xfs: Remove duplicate include
  xfs: correct calculation for agend and blockcount
  xfs: process free extents to busy list in FIFO order
  xfs: adjust the incore perag block_count when shrinking
2023-10-14 09:09:20 -07:00
Amir Goldstein
32db510708 ovl: fix regression in showing lowerdir mount option
Before commit b36a5780cb ("ovl: modify layer parameter parsing"),
spaces and commas in lowerdir mount option value used to be escaped using
seq_show_option().

In current upstream, when lowerdir value has a space, it is not escaped
in /proc/mounts, e.g.:

  none /mnt overlay rw,relatime,lowerdir=l l,upperdir=u,workdir=w 0 0

which results in broken output of the mount utility:

  none on /mnt type overlay (rw,relatime,lowerdir=l)

Store the original lowerdir mount options before unescaping and show
them using the same escaping used for seq_show_option() in addition to
escaping the colon separator character.

Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-14 09:24:11 +03:00
Linus Torvalds
a1ef447dee Fixes for an overreaching WARN_ON, two error paths and a switch to
kernel_connect() which recently grown protection against someone using
 BPF to rewrite the address.  All but one marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmUpYoQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwvmCACK13dkAaupcHyteYPloBgtJLNixR3X
 6++nHCOXGtE7cK6n1snobFQgp/d5BqSKAeymyLqDjJOJVqG/5n8FR1gwcuY/Ogdj
 Aju2Mkt7R/R/V6kvmCbqGSwiOvxZP1gBFkKcluRNQkFNP3boKw4vmJGq29Rabbyr
 66NfZSETsR/H4JhAWvUyVmffrvxIx11THnvmrAnprGVKoK72HwOQFx0H4KLD2Hio
 aN/yiiR15yDS6hL8dfilt+QGO6o+ZWgvRl1GOzqjwISgfeUUYVknMJXrEf+20kHs
 kX3tWaLWVZsU1FyVFRO5HPYLAREjALXstFO4K1OHPERM7EipWNinIHoS
 =wyYP
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Fixes for an overreaching WARN_ON, two error paths and a switch to
  kernel_connect() which recently grown protection against someone using
  BPF to rewrite the address.

  All but one marked for stable"

* tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client:
  ceph: fix type promotion bug on 32bit systems
  libceph: use kernel_connect()
  ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()
  ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13 11:27:31 -07:00
Dan Clash
03adc61eda
audit,io_uring: io_uring openat triggers audit reference count underflow
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.

A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.

These first part of the system call performs two increments that do not race.

do_syscall_64()
  __do_sys_io_uring_enter()
    io_submit_sqes()
      io_openat_prep()
        __io_openat_prep()
          getname()
            getname_flags()       /* update 1 (increment) */
              __audit_getname()   /* update 2 (increment) */

The openat op is queued to an io_uring worker thread which starts the
opportunity for a race.  The system call exit performs one decrement.

do_syscall_64()
  syscall_exit_to_user_mode()
    syscall_exit_to_user_mode_prepare()
      __audit_syscall_exit()
        audit_reset_context()
           putname()              /* update 3 (decrement) */

The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.

io_wqe_worker()
  io_worker_handle_work()
    io_wq_submit_work()
      io_issue_sqe()
        io_openat()
          io_openat2()
            do_filp_open()
              path_openat()
                __audit_inode()   /* update 4 (increment) */
            putname()             /* update 5 (decrement) */
        __audit_uring_exit()
          audit_reset_context()
            putname()             /* update 6 (decrement) */

The fix is to change the refcnt member of struct audit_names
from int to atomic_t.

kernel BUG at fs/namei.c:262!
Call Trace:
...
 ? putname+0x68/0x70
 audit_reset_context.part.0.constprop.0+0xe1/0x300
 __audit_uring_exit+0xda/0x1c0
 io_issue_sqe+0x1f3/0x450
 ? lock_timer_base+0x3b/0xd0
 io_wq_submit_work+0x8d/0x2b0
 ? __try_to_del_timer_sync+0x67/0xa0
 io_worker_handle_work+0x17c/0x2b0
 io_wqe_worker+0x10a/0x350

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-13 18:34:46 +02:00
Namjae Jeon
1903e6d057 ksmbd: fix potential double free on smb2_read_pipe() error path
Fix new smatch warnings:
fs/smb/server/smb2pdu.c:6131 smb2_read_pipe() error: double free of 'rpc_resp'

Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 22:44:11 -05:00
Namjae Jeon
414849040f ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()
Coverity Scan report the following one. This report is a false alarm.
Because fp is never NULL when rc is zero. This patch add null check for fp
in ksmbd_update_fstate to make alarm silence.

*** CID 1568583:  Null pointer dereferences  (FORWARD_NULL)
/fs/smb/server/smb2pdu.c: 3408 in smb2_open()
3402                    path_put(&path);
3403                    path_put(&parent_path);
3404            }
3405            ksmbd_revert_fsids(work);
3406     err_out1:
3407            if (!rc) {
>>>     CID 1568583:  Null pointer dereferences  (FORWARD_NULL)
>>>     Passing null pointer "fp" to "ksmbd_update_fstate", which dereferences it.
3408                    ksmbd_update_fstate(&work->sess->file_table, fp, FP_INITED);
3409                    rc = ksmbd_iov_pin_rsp(work, (void *)rsp, iov_len);
3410            }
3411            if (rc) {
3412                    if (rc == -EINVAL)
3413                            rsp->hdr.Status = STATUS_INVALID_PARAMETER;

Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: Coverity Scan <scan-admin@coverity.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 22:44:09 -05:00
Namjae Jeon
be0f89d441 ksmbd: fix wrong error response status by using set_smb2_rsp_status()
set_smb2_rsp_status() after __process_request() sets the wrong error
status. This patch resets all iov vectors and sets the error status
on clean one.

Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 22:44:05 -05:00
Namjae Jeon
f43328357d ksmbd: not allow to open file if delelete on close bit is set
Cthon test fail with the following error.

check for proper open/unlink operation
nfsjunk files before unlink:
  -rwxr-xr-x 1 root root 0  9월 25 11:03 ./nfs2y8Jm9
./nfs2y8Jm9 open; unlink ret = 0
nfsjunk files after unlink:
  -rwxr-xr-x 1 root root 0  9월 25 11:03 ./nfs2y8Jm9
data compare ok
nfsjunk files after close:
  ls: cannot access './nfs2y8Jm9': No such file or directory
special tests failed

Cthon expect to second unlink failure when file is already unlinked.
ksmbd can not allow to open file if flags of ksmbd inode is set with
S_DEL_ON_CLS flags.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 22:43:52 -05:00
Amir Goldstein
c34706acf4 ovl: fix regression in parsing of mount options with escaped comma
Ever since commit 91c7794713 ("ovl: allow filenames with comma"), the
following example was legit overlayfs mount options:

  mount -t overlay overlay -o 'lowerdir=/tmp/a\,b/lower' /mnt

The conversion to new mount api moved to using the common helper
generic_parse_monolithic() and discarded the specialized ovl_next_opt()
option separator.

Bring back ovl_next_opt() and use vfs_parse_monolithic_sep() to fix the
regression.

Reported-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Closes: https://lore.kernel.org/r/8da307fb-9318-cf78-8a27-ba5c5a0aef6d@alum.mit.edu/
Fixes: 1784fbc2ed ("ovl: port to new mount api")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-12 18:53:37 +03:00
Amir Goldstein
e001d1447c fs: factor out vfs_parse_monolithic_sep() helper
Factor out vfs_parse_monolithic_sep() from generic_parse_monolithic(),
so filesystems could use it with a custom option separator callback.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-12 18:53:36 +03:00
Paulo Alcantara
81ba109599 smb: client: prevent new fids from being removed by laundromat
Check if @cfid->time is set in laundromat so we guarantee that only
fully cached fids will be selected for removal.  While we're at it,
add missing locks to protect access of @cfid fields in order to avoid
races with open_cached_dir() and cfids_laundromat_worker(),
respectively.

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 09:41:32 -05:00
Paulo Alcantara
e95f3f7446 smb: client: make laundromat a delayed worker
By having laundromat kthread processing cached directories on every
second turned out to be overkill, especially when having multiple SMB
mounts.

Relax it by using a delayed worker instead that gets scheduled on
every @dir_cache_timeout (default=30) seconds per tcon.

This also fixes the 1s delay when tearing down tcon.

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-10-12 09:41:04 -05:00