IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Now that we converted cephfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable ceph_atomic_open() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.
[ aleksandr.mikhalitsyn: adapted to 5fadbd9929 ("ceph: rely on vfs for
setgid stripping") ]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable ceph_set_acl() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable __ceph_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.
[ aleksandr.mikhalitsyn: adapted to b27c82e129 ("attr: port attribute
changes to new types") ]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable ceph_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable ceph_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Enable mknod/symlink/mkdir iops to handle idmapped mounts.
This is just a matter of passing down the mount's idmapping.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This parameter is used to decide if we allow
to perform IO on idmapped mount in case when MDS lacks
support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature.
In this case we can't properly handle MDS permission
checks and if UID/GID-based restrictions are enabled
on the MDS side then IO requests which go through an
idmapped mount may fail with -EACCESS/-EPERM.
Fortunately, for most of users it's not a case and
everything should work fine. But we put work "unsafe"
in the module parameter name to warn users about
possible problems with this feature and encourage
update of cephfs MDS.
Suggested-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.
If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.
In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.
In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
These helpers are required to support idmapped mounts in CephFS.
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The mdsmap.h is only used by CephFS, so move it to fs/ceph.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.
At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.
Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().
Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We will use the 'mdsc' to get the global_id in the following commits.
Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZTxSwQAKCRBZ7Krx/gZQ
6zadAP9o/724KPDCY3ybgwKyEQ1UNjHTriFRBeoF3o2q0WgidwEA+/xS0Xk3i25w
xnSZO/8My1edE1IcK/JDwewH/J+4Kw0=
=N/Lv
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem fixes from Al Viro:
"Assorted fixes all over the place: literally nothing in common, could
have been three separate pull requests.
All are simple regression fixes, but not for anything from this cycle"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
Use of dget() after we'd dropped ->d_lock is too late - dentry might
be gone by that point.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZS4N8wAKCRBZ7Krx/gZQ
65q9AQDhucfo26czFALs6aOceZ1K+FUu3OzgU0gbQaCCLhuubwD/Uu3GXL2KrVaj
uMk7Wv6a68/j1VXwtNMpSb0MV09j/wM=
=xKoB
-----END PGP SIGNATURE-----
Merge tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nfsd fix from Al Viro:
"Catch from lock_rename() audit; nfsd_rename() checked that both
directories belonged to the same filesystem, but only after having
done lock_rename().
Trivial fix, tested and acked by nfs folks"
* tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: lock_rename() needs both directories to live on the same fs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU2lLEACgkQxWXV+ddt
WDvCThAApe+zMNdEhQ/cgrvfzP/X91Q53PXQsdVsrujPyUV8eEV4oJzEwVbJhRdw
3ukIQtvyAMNiWhEBhOQRwxjuUoTCApGAeEEEl1cWWEqQ7G2/2LS4+bcWzgQ3Vu32
dzYL37ddsfe4n7OgfnymtMrnv7kge0XbAlY3GbavaDccZDQDqcD5wSAOyOhfIsH7
kcu4sA5Fi44wVSfAJX1Dms+wXfsmQu/sd3c9Gcyce9Hpy1cEW3vWbApLBE4K0aKX
/JHTdmkAJ20a4APQsfGH+UymyuZgr8d2eGmL9rVYKhT/c+Dow0lNAWYkvGf/MawM
CX3GdP6f6ZOR/anCPZ8nqZCE5AoFykGazvpCCSrvCOpU7o7GqxbAQkWWFcMp1FHW
9TFrj81WK18DeCfCNw7lR3sdMy/2o2nnSUAw3DFY4n/3Lek7FUmrBTHvXlWDot7T
TM9CzYGF840QhL5s5SMYS09YmeI0I34L7HJAi/+qli48SooGuL9RZ29TmzHIX69Y
2bgpS64j06p/AGEnfHAcT1LbpiFCPmO5cpXKv/t40GL5QO5d4WV698ysDGoPYUPO
8CPL85Y8cao56KGJLyOroGz0P1bo+RdNe5bN6xJJoTRn1Y9oUA+bQSnN8x9iuunF
9QZrAIHzNyDcRGzoqgDW+3bivOvIus/Dto/u1P3ap68kP2HTVsY=
=gOyi
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix for a problem with snapshot of a newly created subvolume
that can lead to inconsistent data under some circumstances. Kernel
6.5 added a performance optimization to skip transaction commit for
subvolume creation but this could end up with newer data on disk but
not linked to other structures.
The fix itself is an added condition, the rest of the patch is a
parameter added to several functions"
* tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:
1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
address 36007936, for the leaf/root of a new subvolume that has an ID
of 291. We mark the extent buffer as dirty, and at this point the
subvolume tree has a single node/leaf which is also its root (level 0);
2) We no longer commit the transaction used to create the subvolume at
create_subvol(). We used to, but that was recently removed in
commit 1b53e51a4a ("btrfs: don't commit transaction for every subvol
create");
3) The transaction used to create the subvolume has an ID of 33, so the
extent buffer 36007936 has a generation of 33;
4) Several updates happen to subvolume 291 during transaction 33, several
files created and its tree height changes from 0 to 1, so we end up with
a new root at level 1 and the extent buffer 36007936 is now a leaf of
that new root node, which is extent buffer 36048896.
The commit root remains as 36007936, since we are still at transaction
33;
5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
we end up at transaction.c:create_pending_snapshot(), in the critical
section of a transaction commit.
There we COW the root of subvolume 291, which is extent buffer 36048896.
The COW operation returns extent buffer 36048896, since there's no need
to COW because the extent buffer was created in this transaction and it
was not written yet.
The we call btrfs_copy_root() against the root node 36048896. During
this operation we allocate a new extent buffer to turn into the root
node of the snapshot, copy the contents of the root node 36048896 into
this snapshot root extent buffer, set the owner to 292 (the ID of the
snapshot), etc, and then we call btrfs_inc_ref(). This will create a
delayed reference for each leaf pointed by the root node with a
reference root of 292 - this includes a reference for the leaf
36007936.
After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.
Then we call btrfs_insert_dir_item(), to create the directory entry in
in the tree of subvolume 291 that points to the snapshot. This ends up
needing to modify leaf 36007936 to insert the respective directory
items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
we need to COW the leaf. We end up at btrfs_force_cow_block() and then
at update_ref_for_cow().
At update_ref_for_cow() we call btrfs_block_can_be_shared() which
returns false, despite the fact the leaf 36007936 is shared - the
subvolume's root and the snapshot's root point to that leaf. The
reason that it incorrectly returns false is because the commit root
of the subvolume is extent buffer 36007936 - it was the initial root
of the subvolume when we created it. So btrfs_block_can_be_shared()
which has the following logic:
int btrfs_block_can_be_shared(struct btrfs_root *root,
struct extent_buffer *buf)
{
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
buf != root->node && buf != root->commit_root &&
(btrfs_header_generation(buf) <=
btrfs_root_last_snapshot(&root->root_item) ||
btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
return 1;
return 0;
}
Returns false (0) since 'buf' (extent buffer 36007936) matches the
root's commit root.
As a result, at update_ref_for_cow(), we don't check for the number
of references for extent buffer 36007936, we just assume it's not
shared and therefore that it has only 1 reference, so we set the local
variable 'refs' to 1.
Later on, in the final if-else statement at update_ref_for_cow():
static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct extent_buffer *buf,
struct extent_buffer *cow,
int *last_ref)
{
(...)
if (refs > 1) {
(...)
} else {
(...)
btrfs_clear_buffer_dirty(trans, buf);
*last_ref = 1;
}
}
So we mark the extent buffer 36007936 as not dirty, and as a result
we don't write it to disk later in the transaction commit, despite the
fact that the snapshot's root points to it.
Attempting to access the leaf or dumping the tree for example shows
that the extent buffer was not written:
$ btrfs inspect-internal dump-tree -t 292 /dev/sdb
btrfs-progs v6.2.2
file tree key (292 ROOT_ITEM 33)
node 36110336 level 1 items 2 free space 119 generation 33 owner 292
node 36110336 flags 0x1(WRITTEN) backref revision 1
checksum stored a8103e3e
checksum calced a8103e3e
fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
key (256 INODE_ITEM 0) block 36007936 gen 33
key (257 EXTENT_DATA 0) block 36052992 gen 33
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
total bytes 107374182400
bytes used 38572032
uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
The respective on disk region is full of zeroes as the device was
trimmed at mkfs time.
Obviously 'btrfs check' also detects and complains about this:
$ btrfs check /dev/sdb
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
generation: 33 (33)
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
owner ref check failed [36007936 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
The following tree block(s) is corrupted in tree 292:
tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
root 292 root dir 256 not found
ERROR: errors found in fs roots
found 38572032 bytes used, error(s) found
total csum bytes: 16048
total tree bytes: 1265664
total fs tree bytes: 1118208
total extent tree bytes: 65536
btree space waste bytes: 562598
file data blocks allocated: 65978368
referenced 36569088
Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.
This can be reproduced with the following script:
$ cat test.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
# Use a filesystem with a 64K node size so that we have the same node
# size on every machine regardless of its page size (on x86_64 default
# node size is 16K due to the 4K page size, while on PPC it's 64K by
# default). This way we can make sure we are able to create a btree for
# the subvolume with a height of 2.
mkfs.btrfs -f -n 64K $DEV
mount $DEV $MNT
btrfs subvolume create $MNT/subvol
# Create a few empty files on the subvolume, this bumps its btree
# height to 2 (root node at level 1 and 2 leaves).
for ((i = 1; i <= 300; i++)); do
echo -n > $MNT/subvol/file_$i
done
btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap
umount $DEV
btrfs check $DEV
Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):
$ ./test.sh
Create subvolume '/mnt/sdi/subvol'
Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
owner ref check failed [30539776 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
Wrong generation of child node/leaf, wanted: 5, have: 7
root 257 root dir 256 not found
ERROR: errors found in fs roots
found 917504 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 851968
total fs tree bytes: 393216
total extent tree bytes: 65536
btree space waste bytes: 736550
file data blocks allocated: 0
referenced 0
A test case for fstests will follow soon.
Fixes: 1b53e51a4a ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* Fix a bug where a writev consisting of a bunch of sub-fsblock writes
where the last buffer address is invalid could lead to an infinite
loop.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZTFcYAAKCRBKO3ySh0YR
pg6NAP4zKkrjdzOKyhu+NJDu4lnB4+8eq0PPnW5r5fw4155wTwEA1PKWp5vtKYRk
A3QLraLNS0HxpsguHpfD8KFrPN6LPA8=
=93ew
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
- Fix a bug where a writev consisting of a bunch of sub-fsblock writes
where the last buffer address is invalid could lead to an infinite
loop
* tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix short copy in iomap_write_iter()
Stable Fix:
* Fix a pNFS hang in nfs4_evict_inode()
Bugfixes:
* Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
* Fix a potential oops in nfs_inode_remove_request()
* Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
* Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmUy4cAACgkQ18tUv7Cl
QOstABAAtK5pXSbzXXQY1IIVIxgIT9tGRMPwd+IVUwL0qRIhTqGkiax6UuZVH9id
NxrTQFrrxC4np2X4atYdC4LggOqx6wcdNLQ9oJ56zULHUtycgjqCSjFrKbA3nFa6
tn4EccPvKzjsmsbJW1rCq97XIRqZY1XotjDA3MR72YSnsOncdoLr6ng6virjOxWe
t8K8WS4tScD0kwvclQLpsEDSFEEdjTYORdS3RVXVz4crM45gikmTI7bSyZ4vc0lT
TafhFYMjcHUNyKF8MP0DguYSGTtHxCHuM/0yYZw0ZiqLdZeT7Pk/GVecV6qmMD9W
REkidcFLV53RR4qjUvL0HuM71/cX+yfwGaj6e8nZg5b7s1pJB46IG58ZmQaPwyzP
n9/zxZHIRrNSkj6e+cwFpgpz7v7wjTOsXVUqBLE9q0Rsm71L8TuzW0G/lghY8tE+
TZ4rMVlZnypFybELmEgoRXRZa0G41SeQmKYNMv+wJIwpNyxbKkI3FgpAx0A4RuyP
prYK4S55GxqKiOrP22USHRpvkhxdzeIkmnPGuy/A04VYaVJwANjUI0w0SAgg1/Oj
iNLjvO80wIL7d3fRLpjcjiQ2DtVt63WcnMgR0dV138959vh9CKAF7u7xVdW4Zsk7
cO5/WAFR5cUh2veswsi69+oivXweP57pi0XWmiJamKKckS/PPuQ=
=Z3lY
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Stable Fix:
- Fix a pNFS hang in nfs4_evict_inode()
Fixes:
- Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
- Fix a potential oops in nfs_inode_remove_request()
- Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
- Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases"
* tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server
pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats
pNFS: Fix a hang in nfs4_evict_inode()
NFS: Fix potential oops in nfs_inode_remove_request()
nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmUyw0wACgkQnJ2qBz9k
QNknFgf/TMbqrnLyro0JUY6w4b9mgXTGFqqnaSuopOGK31vlpfiNR6mYw+vGv82E
FcN4CSQFBIK8v7DU4pgCzbD1OxGIdJuWz7tjnI4ntr/jMM+3pqqKYhu+VkKInrBB
HEIMe/WjM0/LbX/wid6xdT3Bcz6lbAySXJFVtxU45umkuv8RGODCAr6Gf1jX3q7m
LsIv8ESCmau5hyesp1Te4N8bv7dK8x3FPpaX12BB8DkuRlaqmzwHXc0ExpMRhII8
LBllG2rUIu2GNx8AqWULw9LyBsNaZSeAF2iUl5taXaDXw8Js8eQzH/Y+wS5KaJNa
M7kszLlAByav/MSuUWWJHOqwgMhhDQ==
=bOXL
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify fix from Jan Kara:
"Disable superblock / mount marks for filesystems that can encode file
handles but not open them (currently only overlayfs).
It is not clear the functionality is useful in any way so let's better
disable it before someone comes up with some creative misuse"
* tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: limit reporting of event with non-decodeable file handles
Starting with commit 5d8edfb900 ("iomap: Copy larger chunks from
userspace"), iomap_write_iter() can get into endless loop. This can
be reproduced with LTP writev07 which uses partially valid iovecs:
struct iovec wr_iovec[] = {
{ buffer, 64 },
{ bad_addr, 64 },
{ buffer + 64, 64 },
{ buffer + 64 * 2, 64 },
};
commit bc1bb416bb ("generic_perform_write()/iomap_write_actor():
saner logics for short copy") previously introduced the logic, which
made short copy retry in next iteration with amount of "bytes" it
managed to copy:
if (unlikely(status == 0)) {
/*
* A short copy made iomap_write_end() reject the
* thing entirely. Might be memory poisoning
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
if (copied)
bytes = copied;
However, since 5d8edfb900 "bytes" is no longer carried into next
iteration, because it is now always initialized at the beginning of
the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with
same value as previous iteration, making the loop retry same copy
over and over, which leads to writev07 testcase hanging.
Make next iteration retry with amount of bytes we managed to copy.
Fixes: 5d8edfb900 ("iomap: Copy larger chunks from userspace")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
=Dqrj
-----END PGP SIGNATURE-----
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmUwyyAACgkQqbAzH4Mk
B7aQsg/9FohvuET2YOqYlmtgbq1YjFlWkVxoNB2Be4gxKEqRcFN4V/O8PrOD0ZBX
KH8ASWGp9EKQlVKucAPUhSaZMv4u2HaKTVyRnH5KpUmNG7rC7EzRgYDY4xA5RfjL
aIxgVOhaSibFBGfys2KoS8MHFgZ9mcCWZZ1154ueSkZfX9ghytZcql5i4Ahi2OGQ
u+aa/tkhoIS9ZKWp/uIz4R+fDClrG8UoiWmJ1LoOHQ5YM3enboOld6Zz1CjbC9yN
ldInX7klL9aHRf3SoAeRVCyL23e4GB5R4+QmBhZQemhwy0jF0/7E0kCmCK0wYFWn
3t1M/+i0S7o7rI+utTA2mTuZzxnNPMXJsn2TnpH9vv1o633WOlL0mrvsM7Z/NJDz
VShPinQ9eJGvZQ9kYuMfZODtK0BP7OV8mRi7aXinzHucG+ISB2fBGxsnA0lX9EfT
fHix1oNI33lfazVzaStJ3hs/n9KXwP/9P49rBcLh75O2t3TBKc2T84qpywNx+W6S
9GDMV5V2FNWf/hMCL8pcTtab3MvpbeU5DqKnxW1D1vu2vdGiHVum/Aq26aE5W7nl
EcE5pjKppdNZgk3Qwj190vym7d8lbK4Esua6ieNM/qvWb7hAGoZO7Tjcypvj+YMY
W4aiJTW/jULPT2tkUvLvHg/r5DFHi30v6eCeTCrPfivxFRIAzBQ=
=ScrJ
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:
- memory leak
- some logic errors, NULL dereferences
- some code was refactored
- more sanity checks
* tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: Avoid possible memory leak
fs/ntfs3: Fix directory element type detection
fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e()
fs/ntfs3: Fix OOB read in ntfs_init_from_boot
fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea()
fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame()
fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr()
fs/ntfs3: Do not allow to change label if volume is read-only
fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo
fs/ntfs3: Refactoring and comments
fs/ntfs3: Fix alternative boot searching
fs/ntfs3: Allow repeated call to ntfs3_put_sbi
fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime
fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super
fs/ntfs3: fix deadlock in mark_as_free_ex
fs/ntfs3: Add more attributes checks in mi_enum_attr()
fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN)
fs/ntfs3: Write immediately updated ntfs state
fs/ntfs3: Add ckeck in ni_update_parent()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUuntEACgkQxWXV+ddt
WDssdQ/9Fo6tN+MCH5ISAcvsW6WvBUWT62MrDnzawh96QhUf3NYf9sjME7QqwHHv
w60SDiqRlAd5UzxdIPC4Qa/6GVZZh2yFLzew3l8Fh6anxhjO5argdsfx1Wv4ADk/
FHI8zs6EZiTlk0JmEnHNclliZfaDutBRQiPL+HZx4+FCrJweS5U/4Jpg7vdfp/tp
eWdJ51pDM8iyqGTsP7a7/VaL5wLoJhbdD9wYgupZUhvY6g2tCZ71/hNiWdbKtCK8
EyQxXiAlc+k1UflOx6Xip1HLIh6HmKwxntXxRy+yj4IvJ3PhI+KS5Nqdl35TszN9
6y9MRo3oCU+2y89Yay4HZZb6DLxcAi6VwpyswnntodFQ+ICXEw7ZaNi3rSO+FCO8
KxfhLniMD5gflRP4gy+o9iZxgVQ75nmiPgBt53r+sAKZ7lv86x84DJ/ZUqL8EV0e
OJhxdzhoT0Ks8OstIuE87fgzUCjqMcgAavxcn1psKBC6/JY9v6OneA8qauSswkKs
P+diJIqZHHOBQVKFedqdIrDU6AstivSBq0ToPBslbBlcy97EO4IRoiMIw+QgHPYn
CHsPHtooBmxPyw+4HTFuzY1NIrSeUFYxTDAs9p5kMPmltkVAlLPcrpGZVya9tjds
l/YuwY2f0C9Q1pjcAc9FcN8Y5kLRCYNEWMl0M1VpC22KgjRN6r0=
=GrNu
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix a bug in chunk size decision that could lead to suboptimal
placement and filling patterns"
* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit a95aef69a7 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1f ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a7 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
This patches fixes commit 51d674a5e4 "NFSv4.1: use
EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that
commit was to mark EXCHANGE_ID to the DS with the appropriate
flag.
However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS
and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only
remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that
to the MDS.
Instead, just mark the pnfs path exclusively.
Fixes: 51d674a5e4 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we check the layout pointer and validity after dereferencing
it in ff_layout_mirror_prepare_stats.
Fixes: 08e2e5bc6c ("pNFS/flexfiles: Clean up layoutstats")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We are not allowed to call pnfs_mark_matching_lsegs_return() without
also holding a reference to the layout header, since doing so could lead
to the reference count going to zero when we call
pnfs_layout_remove_lseg(). This again can lead to a hang when we get to
nfs4_evict_inode() and are unable to clear the layout pointer.
pnfs_layout_return_unused_byserver() is guilty of this behaviour, and
has been seen to trigger the refcount warning prior to a hang.
Fixes: b6d49ecd10 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
... checking that after lock_rename() is too late. Incidentally,
NFSv2 had no nfserr_xdev...
Fixes: aa387d6ce1 "nfsd: fix EXDEV checking in rename"
Cc: stable@vger.kernel.org # v3.9+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit f6fca3917b "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size. Commit 5da431b71d "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.
After commit f6fca3917b and 5da431b71d, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:
1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
unmodified. Before f6fca3917b, ctl->max_stripe_len value was
1 GiB for non-zoned data chunks and not configurable.
2. btrfs_create_chunk calls gather_device_info which consumes
and produces more fields of chunk_ctl.
3. gather_device_info multiplies ctl->max_stripe_len by
ctl->dev_stripes (which is 1 in all cases except dup)
and calls find_free_dev_extent with that number as num_bytes.
4. find_free_dev_extent locates the first dev_extent hole on
a device which is at least as large as num_bytes. With default
max_chunk_size from f6fca3917b, it finds the first hole which is
longer than 10 GiB, or the largest hole if that hole is shorter
than 10 GiB. This is different from the pre-f6fca3917b4d
behavior, where num_bytes is 1 GiB, and find_free_dev_extent
may choose a different hole.
5. gather_device_info repeats step 4 with all devices to find
the first or largest dev_extent hole that can be allocated on
each device.
6. gather_device_info sorts the device list by the hole size
on each device, using total unallocated space on each device to
break ties, then returns to btrfs_create_chunk with the list.
7. btrfs_create_chunk calls decide_stripe_size_regular.
8. decide_stripe_size_regular finds the largest stripe_len that
fits across the first nr_devs device dev_extent holes that were
found by gather_device_info (and satisfies other constraints
on stripe_len that are not relevant here).
9. decide_stripe_size_regular caps the length of the stripe it
computed at 1 GiB. This cap appeared in 5da431b71d to correct
one of the other regressions introduced in f6fca3917b.
10. btrfs_create_chunk creates a new chunk with the above
computed size and number of devices.
At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.
Consider a filesystem using raid1 profile with 3 devices. After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
Device 2: [__________] (10 GiB contig unallocated)
Device 3: [__________] (10 GiB contig unallocated)
Before f6fca3917b, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):
[after 0 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [__________] (10 GiB)
Device 3: [__________] (10 GiB)
[after 1 chunk allocation]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
Device 2: [12] [_________] (9 GiB)
Device 3: [__________] (10 GiB)
[after 2 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [12] [_________] (9 GiB)
Device 3: [13] [_________] (9 GiB)
[after 3 chunk allocations]
Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
Device 2: [12] [12] [________] (8 GiB)
Device 3: [13] [_________] (9 GiB)
[...]
[after 12 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 13 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 14 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
[after 15 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
This allocates all of the space with no waste. The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space. This keeps
usable space on each device equal so they can all be filled completely.
After f6fca3917b, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:
[after 1 chunk allocation]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [_________] (9 GiB)
Device 3: [23] [_________] (9 GiB)
[after 2 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [________] (8 GiB)
Device 3: [23] [23] [________] (8 GiB)
[after 3 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [_______] (7 GiB)
Device 3: [23] [23] [23] [_______] (7 GiB)
[...]
[after 9 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 10 chunk allocations]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 11 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
No further allocations are possible, with 8 GiB wasted (4 GiB of data
space). The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.
Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.
Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.
To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
While researching the background of the earlier commits, I found that an
identical fix was already proposed at:
https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
The previous review missed one detail: ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect. ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.
Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmUrsD8ACgkQEVvVyTe/
1WpCmg//XfJm6TdeFFuoEoYezytHXnEHnIM+kHMOqmtY1AoaW4884dvevKOCBsUF
BXXfnIsJc1GmKBDCOxaWFXTRN9iLHTLww8dZzJFX1vDYsqo96t7TbbssyOuOy9vp
K4uYX/pZzuggyBCUZvz/UdjZPEAwPVmsU/uXe/gkHQLS+wpOH0e7hOB7p91I0iZx
0JdleyoDyw2AtqWLscGrTYqozW+lNMl0smADWaGfLcn29rRkR10mhq5heYa+4b2C
TiG29rpIDgXyhz9w+Tq4f3oxJDrE1qlQGX7G8tJjISS9UDHxrZqFun0+nr15Y8Ge
3YstSaMlarAx+UGtuEX80QcYHlYDAWnCKwkRAD8wBtKLeO0pG6BSmCCVi34CAoN+
NBy5KQeoDV96vqoIc8lDmXWOgOkzOogJzzspOWT3H7jmonTjkYL/rYGuV1V6lpuk
Ngopc7HxjiZeGb9Zchr1KOT6xJzjYm74/Ph8I9ECPAamgO6UxlOcsBF2uc7OZIlL
Jzv5Hl+ITiXzxa/KQonx2nEtdO8INf5wxHjy4nlnnY0dcibGCA1wi/3/8/vqJBIf
tuRoQJ/eHruTf23dwPRFwo4JoOViSv3Up9GFyEodllBA6DSAHK96low8eeEZAw9e
MeyWy45dewgH1jvx3bb8Sd29CVxUXvMEl33hYngu09/cv7XulMw=
=9LTH
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs fixes from Amir Goldstein:
- Various fixes for regressions due to conversion to new mount
api in v6.5
- Disable a new mount option syntax (append lowerdir) that was
added in v6.5 because we plan to add a different lowerdir
append syntax in v6.7
* tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: temporarily disable appending lowedirs
ovl: fix regression in showing lowerdir mount option
ovl: fix regression in parsing of mount options with escaped comma
fs: factor out vfs_parse_monolithic_sep() helper
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPnIACgkQiiy9cAdy
T1FTewv/ZgF5CX5M8EpLvUENcdzJQIh2JolP2PeSbqVjf/pI/LKrkGP6fHEo8+S2
Mnntxik3aNZ3hQ755SpGw+mT4hgr2umsDDrZxLVjvDDiLxqW9zlr55JdBS5xJvxN
enKJ8wDbP+Usn4Gb0TfY9xrPWgHyYSn9+dYoPYuh1Z1zqmhFfIRpGSBHbwUM6Ssa
vONpZUdnzdRnuHKyU3+xgU4Pr6KZLnriM/iGgjncCHJRCem0f27W50xsXkkpoVIg
GIhNRe//YL1dlqbpQpXY+4+KI/3d2JRZeVpnzCJ7ucuwyjq5KNJSnTI7Jsgqnf3V
ADe9m/HnknOG3lkQrzojxTNGLmXqlvsxUUNVtjGRccAHaOJDCsKA5Wv5L2jJahdP
ynuXz5iwtQqaHPkfIL5D48RYiMTVemmsHdP+cdnleXkcU8GpN7cd2Gnt591Wo8by
lcRS01pMlRfSh6SyKcDghEHbb2BDKyRSbIlvsy+85CBdKAqOpJXl/96p67UPcQjI
/K9g3cY6
=X4B/
-----END PGP SIGNATURE-----
Merge tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix for possible double free in RPC read
- Add additional check to clarify smb2_open path and quiet Coverity
- Fix incorrect error rsp in a compounding path
- Fix to properly fail open of file with pending delete on close
* tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix potential double free on smb2_read_pipe() error path
ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()
ksmbd: fix wrong error response status by using set_smb2_rsp_status()
ksmbd: not allow to open file if delelete on close bit is set
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPK0ACgkQiiy9cAdy
T1FULQv+IFTddGvRcreJFFeq5pndSGNuN4gs2dUkc+wQaS80wH1TI/tmmOB8DNx7
amXhPv4bo9inbus0ZAZLlOY6Vr7bAhQTGjXLJEVwes+yYxpxbJVmNt9aPCA6yiaP
OAC5jKUEIdIgQkcwpSaUYkiCayB0i38R300HSFLFk2XqPOt5QRCIxrBm5Ssq7yN2
OckglUl0p+FY54kI4kL6sfXi7tRV6rqjzofIxJJuiUqU3IUJk+G1b/pNouWLhEpi
wQPmpdcRvV70QGinzdsuVWNQV8HEZ+zvQQkdiIqKPqEl1hMzndFYMfSNnvDcDN+W
xMmcnIW0ei/3qjYDZ3wjnMEH/978nBrQEtUL/dpAXOmVr+U72NKQE22IlWEICSo8
GZ7JJxcB20WHNHzSFH1+Y+7kgNTgwLd59uK27xqm1/8+qcvhf/MMnjA/ym9F4pyq
4IEHPy7keQ5apSdBhtItJxg5etjpjujQomk1yRxrIU1JAnF0LpL/BFX85aT19QxX
hbbqWeyd
=ZlF9
-----END PGP SIGNATURE-----
Merge tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix caching race with open_cached_dir and laundromat cleanup of
cached dirs (addresses a problem spotted with xfstest run with
directory leases enabled)
- reduce excessive resource usage of laundromat threads
* tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: prevent new fids from being removed by laundromat
smb: client: make laundromat a delayed worker
Kernel v6.5 converted overlayfs to new mount api.
As an added bonus, it also added a feature to allow appending lowerdirs
using lowerdir=:/lower2,lowerdir=::/data3 syntax.
This new syntax has raised some concerns regarding escaping of colons.
We decided to try and disable this syntax, which hasn't been in the wild
for so long and introduce it again in 6.7 using explicit mount options
lowerdir+=/lower2,datadir+=/data3.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegsr3A4YgF2YBevWa6n3=AcP7hNndG6EPMu3ncvV-AM71A@mail.gmail.com/
Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
* Fix calculation of offset of AG's last block and its length.
* Update incore AG block count when shrinking an AG.
* Process free extents to busy list in FIFO order.
* Make XFS report its i_version as the STATX_CHANGE_COOKIE.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZSjJjQAKCRAH7y4RirJu
9HK3AQDLHHw5nlrVShiOtps543OOJ1uBA2PO/lxvvM3X6GPbbwEA8uYLqNPThpF2
+a5h8y9LV6uTEDsdpSoEPBmCgXKhYQg=
=RmDo
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix calculation of offset of AG's last block and its length
- Update incore AG block count when shrinking an AG
- Process free extents to busy list in FIFO order
- Make XFS report its i_version as the STATX_CHANGE_COOKIE
* tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE
xfs: Remove duplicate include
xfs: correct calculation for agend and blockcount
xfs: process free extents to busy list in FIFO order
xfs: adjust the incore perag block_count when shrinking
Before commit b36a5780cb ("ovl: modify layer parameter parsing"),
spaces and commas in lowerdir mount option value used to be escaped using
seq_show_option().
In current upstream, when lowerdir value has a space, it is not escaped
in /proc/mounts, e.g.:
none /mnt overlay rw,relatime,lowerdir=l l,upperdir=u,workdir=w 0 0
which results in broken output of the mount utility:
none on /mnt type overlay (rw,relatime,lowerdir=l)
Store the original lowerdir mount options before unescaping and show
them using the same escaping used for seq_show_option() in addition to
escaping the colon separator character.
Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
kernel_connect() which recently grown protection against someone using
BPF to rewrite the address. All but one marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmUpYoQTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwvmCACK13dkAaupcHyteYPloBgtJLNixR3X
6++nHCOXGtE7cK6n1snobFQgp/d5BqSKAeymyLqDjJOJVqG/5n8FR1gwcuY/Ogdj
Aju2Mkt7R/R/V6kvmCbqGSwiOvxZP1gBFkKcluRNQkFNP3boKw4vmJGq29Rabbyr
66NfZSETsR/H4JhAWvUyVmffrvxIx11THnvmrAnprGVKoK72HwOQFx0H4KLD2Hio
aN/yiiR15yDS6hL8dfilt+QGO6o+ZWgvRl1GOzqjwISgfeUUYVknMJXrEf+20kHs
kX3tWaLWVZsU1FyVFRO5HPYLAREjALXstFO4K1OHPERM7EipWNinIHoS
=wyYP
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Fixes for an overreaching WARN_ON, two error paths and a switch to
kernel_connect() which recently grown protection against someone using
BPF to rewrite the address.
All but one marked for stable"
* tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client:
ceph: fix type promotion bug on 32bit systems
libceph: use kernel_connect()
ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()
ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fix new smatch warnings:
fs/smb/server/smb2pdu.c:6131 smb2_read_pipe() error: double free of 'rpc_resp'
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Coverity Scan report the following one. This report is a false alarm.
Because fp is never NULL when rc is zero. This patch add null check for fp
in ksmbd_update_fstate to make alarm silence.
*** CID 1568583: Null pointer dereferences (FORWARD_NULL)
/fs/smb/server/smb2pdu.c: 3408 in smb2_open()
3402 path_put(&path);
3403 path_put(&parent_path);
3404 }
3405 ksmbd_revert_fsids(work);
3406 err_out1:
3407 if (!rc) {
>>> CID 1568583: Null pointer dereferences (FORWARD_NULL)
>>> Passing null pointer "fp" to "ksmbd_update_fstate", which dereferences it.
3408 ksmbd_update_fstate(&work->sess->file_table, fp, FP_INITED);
3409 rc = ksmbd_iov_pin_rsp(work, (void *)rsp, iov_len);
3410 }
3411 if (rc) {
3412 if (rc == -EINVAL)
3413 rsp->hdr.Status = STATUS_INVALID_PARAMETER;
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: Coverity Scan <scan-admin@coverity.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
set_smb2_rsp_status() after __process_request() sets the wrong error
status. This patch resets all iov vectors and sets the error status
on clean one.
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Cthon test fail with the following error.
check for proper open/unlink operation
nfsjunk files before unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
./nfs2y8Jm9 open; unlink ret = 0
nfsjunk files after unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
data compare ok
nfsjunk files after close:
ls: cannot access './nfs2y8Jm9': No such file or directory
special tests failed
Cthon expect to second unlink failure when file is already unlinked.
ksmbd can not allow to open file if flags of ksmbd inode is set with
S_DEL_ON_CLS flags.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ever since commit 91c7794713 ("ovl: allow filenames with comma"), the
following example was legit overlayfs mount options:
mount -t overlay overlay -o 'lowerdir=/tmp/a\,b/lower' /mnt
The conversion to new mount api moved to using the common helper
generic_parse_monolithic() and discarded the specialized ovl_next_opt()
option separator.
Bring back ovl_next_opt() and use vfs_parse_monolithic_sep() to fix the
regression.
Reported-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Closes: https://lore.kernel.org/r/8da307fb-9318-cf78-8a27-ba5c5a0aef6d@alum.mit.edu/
Fixes: 1784fbc2ed ("ovl: port to new mount api")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Factor out vfs_parse_monolithic_sep() from generic_parse_monolithic(),
so filesystems could use it with a custom option separator callback.
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Check if @cfid->time is set in laundromat so we guarantee that only
fully cached fids will be selected for removal. While we're at it,
add missing locks to protect access of @cfid fields in order to avoid
races with open_cached_dir() and cfids_laundromat_worker(),
respectively.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
By having laundromat kthread processing cached directories on every
second turned out to be overkill, especially when having multiple SMB
mounts.
Relax it by using a delayed worker instead that gets scheduled on
every @dir_cache_timeout (default=30) seconds per tcon.
This also fixes the 1s delay when tearing down tcon.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>