IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
I noticed that once an NFSv4.1 callback operation gets a
NFS4ERR_DELAY status on CB_SEQUENCE and then the connection is lost,
the callback client loops, resending it indefinitely.
The switch arm in nfsd4_cb_sequence_done() that handles
NFS4ERR_DELAY uses rpc_restart_call() to rearm the RPC state machine
for the retransmit, but that path does not call the rpc_prepare_call
callback again. Thus cb_seq_status is set to -10008 by the first
NFS4ERR_DELAY result, but is never set back to 1 for the retransmits.
nfsd4_cb_sequence_done() thinks it's getting nothing but a
long series of CB_SEQUENCE NFS4ERR_DELAY replies.
Fixes: 7ba6cad6c8 ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The final bit of stats that is global is the rpc svc_stat. Move this
into the nfsd_net struct and use that everywhere instead of the global
struct. Remove the unused global struct.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is the last global stat, take it out of the nfsd_stats struct and
make it a global part of nfsd, report it the same as always.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We have a global set of counters that we modify for all of the nfsd
operations, but now that we're exposing these stats across all network
namespaces we need to make the stats also be per-network namespace. We
already have some caching stats that are per-network namespace, so move
these definitions into the same counter and then adjust all the helpers
and users of these stats to provide the appropriate nfsd_net struct so
that the stats are maintained for the per-network namespace objects.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We are running nfsd servers inside of containers with their own network
namespace, and we want to monitor these services using the stats found
in /proc. However these are not exposed in the proc inside of the
container, so we have to bind mount the host /proc into our containers
to get at this information.
Separate out the stat counters init and the proc registration, and move
the proc registration into the pernet operations entry and exit points
so that these stats can be exposed inside of network namespaces.
This is an intermediate step, this just exposes the global counters in
the network namespace. Subsequent patches will move these counters into
the per-network namespace container.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We're going to merge the stats all into per network namespace in
subsequent patches, rename these nn counters to be consistent with the
rest of the stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that this isn't used anywhere, remove it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since only one service actually reports the rpc stats there's not much
of a reason to have a pointer to it in the svc_program struct. Adjust
the svc_create_pooled function to take the sv_stats as an argument and
pass the struct through there as desired instead of getting it from the
svc_program->pg_stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A lot of places are setting a blank svc_stats in ->pg_stats and never
utilizing these stats. Remove all of these extra structs as we're not
reporting these stats anywhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The maxcount is the maximum number of bytes for the LISTXATTRS4resok
result. This includes the cookie and the count for the name array,
thus subtract 12 bytes from the maxcount: 8 (cookie) + 4 (array count)
when filling up the name array.
Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the XDR buffer is not large enough to fit all attributes
and the remaining bytes left in the XDR buffer (xdrleft) is
equal to the number of bytes for the current attribute, then
the loop will prematurely exit without setting eof to FALSE.
Also in this case, adding the eof flag to the buffer will
make the reply 4 bytes larger than lsxa_maxcount.
Need to check if there are enough bytes to fit not only the
next attribute name but also the eof as well.
Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Function nfsd4_listxattr_validate_cookie() expects the cookie
as an offset to the list thus it needs to be encoded in big-endian.
Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If LISTXATTRS is sent with a correct cookie but a small maxcount,
this could lead function nfsd4_listxattr_validate_cookie to
return NFS4ERR_BAD_COOKIE. If maxcount = 20, then second check
on function gives RHS = 3 thus any cookie larger than 3 returns
NFS4ERR_BAD_COOKIE.
There is no need to validate the cookie on the return XDR buffer
since attribute referenced by cookie will be the first in the
return buffer.
Fixes: 23e50fe3a5 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Calling fput() directly or though filp_close() from a kernel thread like
nfsd causes the final __fput() (if necessary) to be called from a
workqueue. This means that nfsd is not forced to wait for any work to
complete. If the ->release or ->destroy_inode function is slow for any
reason, this can result in nfsd closing files more quickly than the
workqueue can complete the close and the queue of pending closes can
grow without bounces (30 million has been seen at one customer site,
though this was in part due to a slowness in xfs which has since been
fixed).
nfsd does not need this. It is quite appropriate and safe for nfsd to
do its own close work. There is no reason that close should ever wait
for nfsd, so no deadlock can occur.
It should be safe and sensible to change all fput() calls to
__fput_sync(). However in the interests of caution this patch only
changes two - the two that can be most directly affected by client
behaviour and could occur at high frequency.
- the fput() implicitly in flip_close() is changed to __fput_sync()
by calling get_file() first to ensure filp_close() doesn't do
the final fput() itself. If is where files opened for IO are closed.
- the fput() in nfsd_read() is also changed. This is where directories
opened for readdir are closed.
This ensure that minimal fput work is queued to the workqueue.
This removes the need for the flush_delayed_fput() call in
nfsd_file_close_inode_sync()
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The work of closing a file can have non-trivial cost. Doing it in a
separate work queue thread means that cost isn't imposed on the nfsd
threads and an imbalance can be created. This can result in files being
queued for the work queue more quickly that the work queue can process
them, resulting in unbounded growth of the queue and memory exhaustion.
To avoid this work imbalance that exhausts memory, this patch moves all
closing of files into the nfsd threads. This means that when the work
imposes a cost, that cost appears where it would be expected - in the
work of the nfsd thread. A subsequent patch will ensure the final
__fput() is called in the same (nfsd) thread which calls filp_close().
Files opened for NFSv3 are never explicitly closed by the client and are
kept open by the server in the "filecache", which responds to memory
pressure, is garbage collected even when there is no pressure, and
sometimes closes files when there is particular need such as for rename.
These files currently have filp_close() called in a dedicated work
queue, so their __fput() can have no effect on nfsd threads.
This patch discards the work queue and instead has each nfsd thread call
flip_close() on as many as 8 files from the filecache each time it acts
on a client request (or finds there are no pending client requests). If
there are more to be closed, more threads are woken. This spreads the
work of __fput() over multiple threads and imposes any cost on those
threads.
The number 8 is somewhat arbitrary. It needs to be greater than 1 to
ensure that files are closed more quickly than they can be added to the
cache. It needs to be small enough to limit the per-request delays that
will be imposed on clients when all threads are busy closing files.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert the qnx4 filesystem to use the new mount API.
Tested mount, umount, and remount using a qnx4 boot image.
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Link: https://lore.kernel.org/r/20240229161649.800957-1-bodonnel@redhat.com
Acked-by: Anders Larsen <al@alarsen.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The function inode_set_ctime_current simply retrieves the current time
and assigns it to the field __i_ctime without any alterations. Therefore,
it is possible to set ctime to now directly using inode_set_ctime_to_ts
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Link: https://lore.kernel.org/r/20240228173031.3208743-1-phind.uet@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Both pidfs and nsfs use a memory location to stash a dentry for reuse by
concurrent openers. Right now two custom
dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do
the same thing. The only thing that differs is that they need to get to
the memory location to store or retrieve the dentry from differently.
Fix that by remember the stashing location for the dentry in
dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That
in turn makes it possible to add a common helper that pidfs and nsfs can
both use.
Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In earlier patches we moved both nsfs and pidfs to path_from_stashed().
The helper currently tries to add and stash a new dentry if a reusable
dentry couldn't be found and returns EAGAIN if it lost the race to stash
the dentry. The caller can use EAGAIN to retry.
The helper and the two filesystems be written in a way that makes
returning EAGAIN unnecessary. To do this we need to change the
dentry->d_prune() implementation of nsfs and pidfs to not simply replace
the stashed dentry with NULL but to use a cmpxchg() and only replace
their own dentry.
Then path_from_stashed() can then be changed to not just stash a new
dentry when no dentry is currently stashed but also when an already dead
dentry is stashed. If another task managed to install a dentry in the
meantime it can simply be reused. Pack that into a loop and call it a
day.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
selinux denials and thus various userspace components that make heavy
use of pidfds to fail as pidfds used anon_inode_getfile() which aren't
subject to any LSM hooks. But dentry_open() is and that would cause
regressions.
The failures that are seen are selinux denials. But the core failure is
dbus-broker. That cascades into other services failing that depend on
dbus-broker. For example, when dbus-broker fails to start polkit and all
the others won't be able to work because they depend on dbus-broker.
The reason for dbus-broker failing is because it doesn't handle failures
for SO_PEERPIDFD correctly. Last kernel release we introduced
SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit
and others to receive a pidfd for the peer of an AF_UNIX socket. This is
the first time in the history of Linux that we can safely authenticate
clients in a race-free manner.
dbus-broker immediately made use of this but messed up the error
checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD.
That's obviously problematic not just because of LSM denials but because
of seccomp denials that would prevent SO_PEERPIDFD from working; or any
other new error code from there.
So this is catching a flawed implementation in dbus-broker as well. It
has to fallback to the old pid-based authentication when SO_PEERPIDFD
doesn't work no matter the reasons otherwise it'll always risk such
failures. So overall that LSM denial should not have caused dbus-broker
to fail. It can never assume that a feature released one kernel ago like
SO_PEERPIDFD can be assumed to be available.
So, the next fix separate from the selinux policy update is to try and
fix dbus-broker at [3]. That should make it into Fedora as well. In
addition the selinux reference policy should also be updated. See [4]
for that. If Selinux is in enforcing mode in userspace and it encounters
anything that it doesn't know about it will deny it by default. And the
policy is entirely in userspace including declaring new types for stuff
like nsfs or pidfs to allow it.
For now we continue to raise S_PRIVATE on the inode if it's a pidfs
inode which means things behave exactly like before.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630
Link: https://github.com/fedora-selinux/selinux-policy/pull/2050
Link: https://github.com/bus1/dbus-broker/pull/343 [3]
Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X
Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:
* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
file->private_data. This means it is now possible to introduce
concepts that operate on a process once all file descriptors have been
closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
pid will refer to the same inode if it's opened multiple times. In
contrast to now where each struct pid refers to the same inode. Even
if we were to move to anon_inode_create_getfile() which creates new
inodes we'd still be associating the same struct pid with multiple
different inodes.
The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.
We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.
The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.
Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe87 ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.
When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:
1) The fiemap task finishes processing a file extent item that covers
the file range [512K, 1M[, and that file extent item is the last item
in the leaf currently being processed;
2) And ordered extent for the file range [768K, 2M[, in COW mode,
completes (btrfs_finish_one_ordered()) and the file extent item
covering the range [512K, 1M[ is trimmed to cover the range
[512K, 768K[ and then a new file extent item for the range [768K, 2M[
is inserted in the inode's subvolume tree;
3) The fiemap task calls fiemap_next_leaf_item(), which then calls
btrfs_next_leaf() to find the next leaf / item. This finds that the
the next key following the one we previously processed (its type is
BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
to the new file extent item inserted by the ordered extent, which has
a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
the warning:
if (cache->offset + cache->len > offset) {
WARN_ON(1);
return -EINVAL;
}
Since we get 1M > 768K, because the previously emitted entry for the
old extent covering the file range [512K, 1M[ ends at an offset that
is greater than the new extent's start offset (768K). This makes fiemap
fail with -EINVAL besides triggering the warning that produces a stack
trace like the following:
[1621.677651] ------------[ cut here ]------------
[1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.677899] Modules linked in: btrfs blake2b_generic (...)
[1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
[1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678033] Code: 2b 4c 89 63 (...)
[1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
[1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
[1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
[1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
[1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
[1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
[1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
[1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
[1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1621.678056] Call Trace:
[1621.678074] <TASK>
[1621.678076] ? __warn+0x80/0x130
[1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678159] ? report_bug+0x1f4/0x200
[1621.678164] ? handle_bug+0x42/0x70
[1621.678167] ? exc_invalid_op+0x14/0x70
[1621.678170] ? asm_exc_invalid_op+0x16/0x20
[1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678253] extent_fiemap+0x766/0xa30 [btrfs]
[1621.678339] btrfs_fiemap+0x45/0x80 [btrfs]
[1621.678420] do_vfs_ioctl+0x1e4/0x870
[1621.678431] __x64_sys_ioctl+0x6a/0xc0
[1621.678434] do_syscall_64+0x52/0x120
[1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76
There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.
So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:
1) For the first case, steps listed above, adjust the length of the
previously cached extent so that it does not overlap with the current
extent, emit the previous one and cache the current file extent item;
2) For the second case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range,
and the current file extent item has an offset that matches the offset
in the fiemap cache, just discard what we have in the fiemap cache and
assign the current file extent item to the cache, since it's more up
to date;
3) For the third case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range
and the offset of the file extent item we just found is smaller than
what we have in the cache, just skip the current file extent item
if its range end at or behind the cached extent's end, because we may
have emitted (to the fiemap user space buffer) delalloc ranges that
overlap with the current file extent item's range. If the file extent
item's range goes beyond the end offset of the cached extent, just
emit the cached extent and cache a subrange of the file extent item,
that goes from the end offset of the cached extent to the end offset
of the file extent item.
Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).
This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the pidfd file operations over to their own file in preparation of
implementing pidfs and to isolate them from other mostly unrelated
functionality in other files.
Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-1-f863f58cfce1@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
No filesystems depend on it anymore, and it is generally a bad idea.
Since all dentries should have the same set of dentry operations in
case-insensitive capable filesystems, it should be propagated through
->s_d_op.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-11-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
fscrypt now supports configuring dentry operations at dentry-creation
time through the preset sb->s_d_op, instead of at lookup time.
Enable this in ubifs, since the lookup-time mechanism is going away.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-10-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
This was already the case for case-insensitive before commit
bb9cd9106b ("fscrypt: Have filesystems handle their d_ops"), but it
was changed to set at lookup-time to facilitate the integration with
fscrypt. But it's a problem because dentries that don't get created
through ->lookup() won't have any visibility of the operations.
Since fscrypt now also supports configuring dentry operations at
creation-time, do it for any encrypted and/or casefold volume,
simplifying the implementation across these features.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-9-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
This was already the case for case-insensitive before commit
bb9cd9106b ("fscrypt: Have filesystems handle their d_ops"), but it
was changed to set at lookup-time to facilitate the integration with
fscrypt. But it's a problem because dentries that don't get created
through ->lookup() won't have any visibility of the operations.
Since fscrypt now also supports configuring dentry operations at
creation-time, do it for any encrypted and/or casefold volume,
simplifying the implementation across these features.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-8-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
In preparation to drop the similar helper that sets d_op at lookup time,
add a version to set the right d_op filesystem-wide, through sb->s_d_op.
The operations structures are shared across filesystems supporting
fscrypt and/or casefolding, therefore we can keep it in common libfs
code.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-7-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
In preparation to get case-insensitive dentry operations from sb->s_d_op
again, use the same structure with and without fscrypt.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-6-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Both fscrypt_prepare_lookup_partial and fscrypt_prepare_lookup will set
DCACHE_NOKEY_NAME for dentries when the key is not available. Extract
out a helper to set this flag in a single place, in preparation to also
add the optimization that will disable ->d_revalidate if possible.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-3-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
overlayfs relies on the filesystem setting DCACHE_OP_HASH or
DCACHE_OP_COMPARE to reject mounting over case-insensitive directories.
Since commit bb9cd9106b ("fscrypt: Have filesystems handle their
d_ops"), we set ->d_op through a hook in ->d_lookup, which
means the root dentry won't have them, causing the mount to accidentally
succeed.
In v6.7-rc7, the following sequence will succeed to mount, but any
dentry other than the root dentry will be a "weird" dentry to ovl and
fail with EREMOTE.
mkfs.ext4 -O casefold lower.img
mount -O loop lower.img lower
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl /mnt
Mounting on a subdirectory fails, as expected, because DCACHE_OP_HASH
and DCACHE_OP_COMPARE are properly set by ->lookup.
Fix by explicitly rejecting superblocks that allow case-insensitive
dentries. Yes, this will be solved when we move d_op configuration back
to ->s_d_op. Yet, we better have an explicit fix to avoid messing up
again.
While there, re-sort the entries to have more descriptive error messages
first.
Fixes: bb9cd9106b ("fscrypt: Have filesystems handle their d_ops")
Acked-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-2-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Casefolded comparisons are (obviously) way more costly than a simple
memcmp. Try the case-sensitive comparison first, falling-back to the
case-insensitive lookup only when needed. This allows any exact-match
lookup to complete without having to walk the utf8 trie.
Note that, for strict mode, generic_ci_d_compare used to reject an
invalid UTF-8 string, which would now be considered valid if it
exact-matches the disk-name. But, if that is the case, the filesystem
is corrupt. More than that, it really doesn't matter in practice,
because the name-under-lookup will have already been rejected by
generic_ci_d_hash and we won't even get here.
The memcmp is safe under RCU because we are operating on str/len instead
of dentry->d_name directly, and the caller guarantees their consistency
between each other in __d_lookup_rcu_op_compare.
Link: https://lore.kernel.org/r/87ttn2sip7.fsf_-_@mailhost.krisman.be
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1 (see [1]), so it became a dead flag since the
commit 16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And
the series[1] went on to mark it obsolete explicitly to avoid confusion
for users. Here we can just remove all its users, which has no any
functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz [1]
Link: https://lore.kernel.org/r/20240224134925.829677-1-chengming.zhou@linux.dev
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The following patch accidentally removed the code for delivering
completions for cancelled reads and writes to user space: "[PATCH 04/33]
aio: remove retry-based AIO"
(https://lore.kernel.org/all/1363883754-27966-5-git-send-email-koverstreet@google.com/)
>From that patch:
- if (kiocbIsCancelled(iocb)) {
- ret = -EINTR;
- aio_complete(iocb, ret, 0);
- /* must not access the iocb after this */
- goto out;
- }
This leads to a leak in user space of a struct iocb. Hence this patch
that restores the code that reports to user space that a read or write
has been cancelled successfully.
Fixes: 41003a7bcf ("aio: remove retry-based AIO")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-3-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If a directory has a block with only ".__afsXXXX" files in it (from
uncompleted silly-rename), these .__afsXXXX files are skipped but without
advancing the file position in the dir_context. This leads to
afs_dir_iterate() repeating the block again and again.
Fix this by making the code that skips the .__afsXXXX file also manually
advance the file position.
The symptoms are a soft lookup:
watchdog: BUG: soft lockup - CPU#3 stuck for 52s! [check:5737]
...
RIP: 0010:afs_dir_iterate_block+0x39/0x1fd
...
? watchdog_timer_fn+0x1a6/0x213
...
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? afs_dir_iterate_block+0x39/0x1fd
afs_dir_iterate+0x10a/0x148
afs_readdir+0x30/0x4a
iterate_dir+0x93/0xd3
__do_sys_getdents64+0x6b/0xd4
This is almost certainly the actual fix for:
https://bugzilla.kernel.org/show_bug.cgi?id=218496
Fixes: 57e9d49c54 ("afs: Hide silly-rename files from userspace")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/786185.1708694102@warthog.procyon.org.uk
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
FSDAX and reflink can work together now, let's drop this warning.
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXcsfAACgkQxWXV+ddt
WDt3XA/6AkPT8QNT+mOyp4NjPzquR4UMIPVGGvjWTeKNtjNnco9gPkOBWsHeeDQe
aiihh3X2NpNtsduEmqaz717EJW4za9lplGiyPR51H/pTfGfOthWL6Nj+auTPva3t
GnlYh+GUQ+44JJ5+biOK5HUpEEeUR87EN2z5lTWsHAxg7PolBiKYKvV4Wp33xJqR
ILGlYw04reOAljTn0Zf738IL5WpY9etj1GnNxQeEKFRrdF1GH1i6r/JRONU1hGHu
EiZT6XwoN07V+JURB+fPqtY1IXODDC8904OwLg5fKhBggWvR2IaiW1XH+ToFXQgU
idae1+Dy85Hi4s40SL5GcSO8mVHPEGEspwM/5G87YqIu3uH4L9+Wd4zTwVYLcwNm
SSUCDGj2d+/JIug5dPBV8GL7jrhPNnPOu8HR+bIxY9XUhyf+IZVlUNYlorup3lbm
rAouZiCevRhQRBAx33Id5ZOMhlIpPONKObcCEKmdm6WLlnkkqgKQbnapd/I/1mfT
nP5N7oWUtfXO4oq4k5XpJBcTVhXU+DzpQ7EMDGv3mSmIem0wsDmXPbF2MfoSIim8
UuToZ1YF5MuxNLGwYnpkUaxWhKKOFWMvAe65eXP+ureIjOJwQ4f85Nkro0JvKbr8
nVdzl3rDy49tnqW7Qu3vaNPOQneuWaOqCoQcYDcVAiqk11UhH9E=
=mBP6
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A more fixes for recently reported or discovered problems:
- fix corner case of send that would generate potentially large
stream of zeros if there's a hole at the end of the file
- fix chunk validation in zoned mode on conventional zones, it was
possible to create chunks that would not be allowed on sequential
zones
- fix validation of dev-replace ioctl filenames
- fix KCSAN warnings about access to block reserve struct members"
* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve
btrfs: fix data races when accessing the reserved amount of block reserves
btrfs: send: don't issue unnecessary zero writes for trailing hole
btrfs: dev-replace: properly validate device names
btrfs: zoned: don't skip block group profile checks on conventional zones
The addition of bal_rank_mask with encoding version 17 was merged
into ceph.git in Oct 2022 and made it into v18.2.0 release normally.
A few months later, the much delayed addition of max_xattr_size got
merged, also with encoding version 17, placed before bal_rank_mask
in the encoding -- but it didn't make v18.2.0 release.
The way this ended up being resolved on the MDS side is that
bal_rank_mask will continue to be encoded in version 17 while
max_xattr_size is now encoded in version 18. This does mean that
older kernels will misdecode version 17, but this is also true for
v18.2.0 and v18.2.1 clients in userspace.
The best we can do is backport this adjustment -- see ceph.git
commit 78abfeaff27fee343fb664db633de5b221699a73 for details.
[ idryomov: changelog ]
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/64440
Fixes: d93231a6bc ("ceph: prevent a client from exceeding the MDS maximum xattr size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When CONFIG_NTFS3_LZX_XPRESS is not set then we get the following build
error:
fs/ntfs3/frecord.c:2460:16: error: unused variable ‘i_size’
Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Fixes: 4fd6c08a16 ("fs/ntfs3: Use i_size_read and i_size_write")
Tested-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some more mostly boring fixes, but some not
User reported ones:
- the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty performance
bug; user reported an unter initially taking 2 seconds and then ~2
minutes
- kill a __GFP_NOFAIL in the buffered read path; this was a leftover
from the trickier fix to kill __GFP_NOFAIL in readahead, where we
can't return errors (and have to silently truncate the read
ourselves).
bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
filesystems because our folio state is just barely too big, 2MB
hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
additionally, the flags argument was just buggy, we weren't supplying
GFP_KERNEL previously (!).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXbqqMACgkQE6szbY3K
bnYjnhAApY0vT6eVIYrZ7JGR6tw++xw02xRkcNW4zFE8INAvxQor5TXMEKkJs9Ui
owh8WZjydXe0FJPE+pROcHMfxkkup4yP2SafgzR8DGERBwZbV9x7hvUbdG90EngY
V/MevV+vr6UaV7133sY70K8BqUA/yAlCmmtOQVFgGRprEtEPS4Ur3vYR5+IzA0N7
OhNXu6LxzkYbrNp9qroCN2UEVgRDJ/Mtda6uHfIUrqOQMUhiq2og9kvzJXzIrW9l
URxm4eFQtJe0Yz09Ppypve+FutJIbtuDEYbcMJNT9Ig7BosD5vDjy9nhp8A5Q1Uk
oDWBbCJhDdSYSVC/EQY8bv0AaCkyCa7vshSoKq0fDCFJ8k+nQ1YMF5wNhfgJhtU9
Tl2Qytphp9/dxkvpIsR/5iNhLply9xTka1Wkp3G+3QJk0c17Dftpvz0/WhKI0P2B
d6y4mz/hfCtWoSQOJbJl3fM/ZVpjH54VHDmb7sGyb5f+bTUkX6OUoJ4os8MNKGcS
GdpEoWt/IAQj69c7w8aama5TXJ4kYe0XtXwbHTRE4j1PIQJA5SPvVt+32spRtb6i
1gIa94uWKYMuG2U0XGxookHfZZZaMQkl79oXJOYRiC589YVyZC1Lp5iqr027jHEQ
1HacrWPekPfmrhchyIzpH1mHOgaS+FKoD7eKrkvj0QSxpwfwpbI=
=KNWR
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Some more mostly boring fixes, but some not
User reported ones:
- the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
performance bug; user reported an untar initially taking two
seconds and then ~2 minutes
- kill a __GFP_NOFAIL in the buffered read path; this was a leftover
from the trickier fix to kill __GFP_NOFAIL in readahead, where we
can't return errors (and have to silently truncate the read
ourselves).
bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
filesystems because our folio state is just barely too big, 2MB
hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
additionally, the flags argument was just buggy, we weren't
supplying GFP_KERNEL previously (!)"
* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
bcachefs: fix bch2_save_backtrace()
bcachefs: Fix check_snapshot() memcpy
bcachefs: Fix bch2_journal_flush_device_pins()
bcachefs: fix iov_iter count underflow on sub-block dio read
bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
bcachefs: Kill __GFP_NOFAIL in buffered read path
bcachefs: fix backpointer_to_text() when dev does not exist
- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmXbQekRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qpkyg//cjKnuzI2gHL3ff1o0jiTMD11Y/EmwzQS
C+JgE8WDMjUlVQpbwYnFWo/6se8qvk/fwTXPcRc45piF8YNZVchwsagxO626Kab7
0xTX7ZUgjuth6ahrkAJZkJNMgO4mYf928uYB6EVWaTb4iF0iT6glEOsSVAR/cssL
wYKbtgv3OP9t6fQcN/XL31hRkqr2MPk0y5Q27KyT4zo4lrx3xyah7Ndo3aEK/RcM
+6FUwqRiDsgDF/Ga65ylDvEp9eA03OFNHBn4DrORe3B9KV75NkmSJf/8QVEceNV/
9D072Hvt7/iyOq53AxWH3Jvp7aro/i0rvAHPbXZX4RVyqcJxaLYCQyrBFvQL0/Ie
B+793Iua8zbkQCbZ85LpTGrxAb5WydlSJp10AuHTr2MO+wS8bBqf96Jp9x1MuS9D
vqq7jbjwuZfnFUjpzu49GF6htG+WRVgY5TDzU6IAr3izXuqVxmz+zw5zExbGMmKm
2S5lb+q68DOBP4YaelAQHh97k0pYDW3eQ6GhDD2FA4P/DOr8p3vszZFBeaqaL70v
WS8z1wzOgEaleTRN4iRFlMCTvRLjOtDEaUNquohcHqJLe7w/DF7gzCU+0//8dVjD
cZieFX8uZDtzzcjrlYWeTo8oHd0q2WYixbCN8P92/YXUFIVpfdl85ugyr9KvkZxd
jKzpAy2LbSw=
=/Bfj
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking
* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix refcount on the metabuf used for inode lookup
pathwalk. This series is a result of code audit (the second round
of it) and it should deal with most of that stuff. Exceptions: ntfs3
->d_hash()/->d_compare() and ceph_d_revalidate(). Up to maintainers (a
note for NTFS folks - when documentation says that a method may not block,
it *does* imply that blocking allocations are to be avoided. Really).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdroDAAKCRBZ7Krx/gZQ
60dKAQCzp6rYr3ye4nylho9Rzu8LEpH04TuNf3C6JuyUaNHxHwEAvNLatZsyFnmV
Ksp2Rg/IlKPNtQgYJ8xPxv9DFmNe8gI=
=47Un
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU pathwalk fixes from Al Viro:
"We still have some races in filesystem methods when exposed to RCU
pathwalk. This series is a result of code audit (the second round of
it) and it should deal with most of that stuff.
Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
Up to maintainers (a note for NTFS folks - when documentation says
that a method may not block, it *does* imply that blocking allocations
are to be avoided. Really)"
[ More explanations for people who aren't familiar with the vagaries of
RCU path walking: most of it is hidden from filesystems, but if a
filesystem actively participates in the low-level path walking it
needs to make sure the fields involved in that walk are RCU-safe.
That "actively participate in low-level path walking" includes things
like having its own ->d_hash()/->d_compare() routines, or by having
its own directory permission function that doesn't just use the common
helpers. Having a ->d_revalidate() function will also have this issue.
Note that instead of making everything RCU safe you can also choose to
abort the RCU pathwalk if your operation cannot be done safely under
RCU, but that obviously comes with a performance penalty. One common
pattern is to allow the simple cases under RCU, and abort only if you
need to do something more complicated.
So not everything needs to be RCU-safe, and things like the inode etc
that the VFS itself maintains obviously already are. But these fixes
tend to be about properly RCU-delaying things like ->s_fs_info that
are maintained by the filesystem and that got potentially released too
early. - Linus ]
* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ext4_get_link(): fix breakage in RCU mode
cifs_get_link(): bail out in unsafe case
fuse: fix UAF in rcu pathwalks
procfs: make freeing proc_fs_info rcu-delayed
procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
nfs: fix UAF on pathwalk running into umount
nfs: make nfs_set_verifier() safe for use in RCU pathwalk
afs: fix __afs_break_callback() / afs_drop_open_mmap() race
hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
affs: free affs_sb_info with kfree_rcu()
rcu pathwalk: prevent bogus hard errors from may_lookup()
fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
and a fix for erofs failure exit breakage (had been there since
way back).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdrkZAAKCRBZ7Krx/gZQ
67D8AP0eM68yZvbThA/Hb5iElDh3Aogt1AW/QAu9/alkDVHr+wD+PKqhamC8WXGk
b1QZ5AOHQFwzkzdF4738fdbujquBWQE=
=Ra0D
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes - revert of regression from this cycle and a fix for
erofs failure exit breakage (had been there since way back)"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
erofs: fix handling kern_mount() failure
Revert "get rid of DCACHE_GENOCIDE"
Add two new helpers to allow opening block devices as files.
This is not the final infrastructure. This still opens the block device
before opening a struct a file. Until we have removed all references to
struct bdev_handle we can't switch the order:
* Introduce blk_to_file_flags() to translate from block specific to
flags usable to pen a new file.
* Introduce bdev_file_open_by_{dev,path}().
* Introduce temporary sb_bdev_handle() helper to retrieve a struct
bdev_handle from a block device file and update places that directly
reference struct bdev_handle to rely on it.
* Don't count block device openes against the number of open files. A
bdev_file_open_by_{dev,path}() file is never installed into any
file descriptor table.
One idea that came to mind was to use kernel_tmpfile_open() which
would require us to pass a path and it would then call do_dentry_open()
going through the regular fops->open::blkdev_open() path. But then we're
back to the problem of routing block specific flags such as
BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste
FMODE_* flags every time we add a new one. With this we can avoid using
a flag bit and we have more leeway in how we open block devices from
bdev_open_by_{dev,path}().
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
In order to add a helper to open files that aren't accounted split
alloc_file() and parts of alloc_file_pseudo() into helpers. One to
prepare a path, another one to setup the file.
Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240129160241.GA2793@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
The 'duplicates' bool argument is always true when efivar_init() is
called from its only caller so let's just drop it instead.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Al points out that kill_sb() will be called if efivarfs_fill_super()
fails and so there is no point in cleaning up the efivar entry list.
Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Work around a quirk in a few old (2011-ish) UEFI implementations, where
a call to `GetNextVariableName` with a buffer size larger than 512 bytes
will always return EFI_INVALID_PARAMETER.
There is some lore around EFI variable names being up to 1024 bytes in
size, but this has no basis in the UEFI specification, and the upper
bounds are typically platform specific, and apply to the entire variable
(name plus payload).
Given that Linux does not permit creating files with names longer than
NAME_MAX (255) bytes, 512 bytes (== 256 UTF-16 characters) is a
reasonable limit.
Cc: <stable@vger.kernel.org> # 6.1+
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->d_revalidate() bails out there, anyway. It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup. Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.
Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
server->flags
server->caps
*(server->io_stats)
and, worst of all, call
server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()). We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry. For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked(). It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.
That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time. If that happens, back-to-back attempts to kill dentry and
its parent are quite normal. Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
do queue_work(&vnode->cb_work). In afs_drop_open_mmap() we decrement
->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.
The trouble is, there's nothing to prevent __afs_break_callback() from
seeing ->cb_nr_mmap before the decrement and do queue_work() after both
the decrement and flush_work(). If that happens, we might be in trouble -
vnode might get freed before the queued work runs.
__afs_break_callback() is always done under ->cb_lock, so let's make
sure that ->cb_nr_mmap can change from non-zero to zero while holding
->cb_lock (the spinlock component of it - it's a seqlock and we don't
need to mess with the counter).
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->d_hash() and ->d_compare() use those, so we need to delay freeing
them.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
that is in process of getting shut down.
Besides, having nls and upcase table cleanup moved from ->put_super() towards
the place where sbi is freed makes for simpler failure exits.
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
one of the flags in it is used by ->d_hash()/->d_compare()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If lazy call of ->permission() returns a hard error, check that
try_to_unlazy() succeeds before returning it. That both makes
life easier for ->permission() instances and closes the race
in ENOTDIR handling - it is possible that positive d_can_lookup()
seen in link_path_walk() applies to the state *after* unlink() +
mkdir(), while nd->inode matches the state prior to that.
Normally seeing e.g. EACCES from permission check in rcu pathwalk
means that with some timings non-rcu pathwalk would've run into
the same; however, running into a non-executable regular file
in the middle of a pathname would not get to permission check -
it would fail with ENOTDIR instead.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Avoids fun races in RCU pathwalk... Same goes for freeing LSM shite
hanging off super_block's arse.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
check_snapshot() copies the bch_snapshot to a temporary to easily handle
older versions that don't have all the fields of the current version,
but it lacked a min() to correctly handle keys newer and larger than the
current version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_direct_IO_read() checks the request offset and size for sector
alignment and then falls through to a couple calculations to shrink
the size of the request based on the inode size. The problem is that
these checks round up to the fs block size, which runs the risk of
underflowing iter->count if the block size happens to be large
enough. This is triggered by fstest generic/361 with a 4k block
size, which subsequently leads to a crash. To avoid this crash,
check that the shorten length doesn't exceed the overall length of
the iter.
Fixes:
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.
Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).
But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the
easy one in read_single_folio() (where wa can return an error) was
missed - oops.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Calling CBC with ciphertext stealing "CBC-CTS" seems to be more common
than calling it "CTS-CBC". E.g., CBC-CTS is used by OpenSSL, Crypto++,
RFC3962, and RFC6803. The NIST SP800-38A addendum uses CBC-CS1,
CBC-CS2, and CBC-CS3, distinguishing between different CTS conventions
but similarly putting the CBC part first. In the interest of avoiding
any idiosyncratic terminology, update the fscrypt documentation and the
fscrypt_mode "friendly names" to align with the more common convention.
Changing the "friendly names" only affects some log messages. The
actual mode constants in the API are unchanged; those call it simply
"CTS". Add a note to the documentation that clarifies that "CBC" and
"CTS" in the API really mean CBC-ESSIV and CBC-CTS, respectively.
Link: https://lore.kernel.org/r/20240224053550.44659-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure. Ensure the allocation was successful
by checking the pointer validity.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20240118100206.213928-1-chentao@kylinos.cn
Signed-off-by: Kees Cook <keescook@chromium.org>
Register the compatible for this module on the module device table so
it can be automatically loaded when a matching DT node is present,
allowing logging of panics and oopses without any intervention.
Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20240110210600.787703-2-nfraprado@collabora.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Unloading a modular pstore backend with records in pstorefs would
trigger the dput() double-drop warning:
WARNING: CPU: 0 PID: 2569 at fs/dcache.c:762 dput.part.0+0x3f3/0x410
Using the combo of d_drop()/dput() (as mentioned in
Documentation/filesystems/vfs.rst) isn't the right approach here, and
leads to the reference counting problem seen above. Use d_invalidate()
and update the code to not bother checking for error codes that can
never happen.
Suggested-by: Alexander Viro <viro@zeniv.linux.org.uk>
Fixes: 609e28bb13 ("pstore: Remove filesystem records when backend is unregistered")
Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-hardening@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZddOmwAKCRCRxhvAZXjc
oq1lAQDus0SGgwuwArdHtbbVj+gTs4s5XKvuGI6mqRiLvgvTzwD/TTNnOqJjWacS
on7XxDHgnjbMR2r90W/MuyPPjtAPkgA=
=i2E/
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix a memory leak in cachefiles
- Restrict aio cancellations to I/O submitted through the aio
interfaces as this is otherwise causing issues for I/O submitted
via io_uring
- Increase buffer for afs volume status to avoid overflow
- Fix a missing zero-length check in unbuffered writes in the
netfs library. If generic_write_checks() returns zero make
netfs_unbuffered_write_iter() return right away
- Prevent a leak in i_dio_count caused by netfs_begin_read() operating
past i_size. It will return early and leave i_dio_count incremented
- Account for ipv4 addresses as well as ipv6 addresses when processing
incoming callbacks in afs
* tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
afs: Increase buffer size in afs_update_volume_status()
afs: Fix ignored callbacks over ipv4
cachefiles: fix memory leak in cachefiles_add_cache()
netfs: Fix missing zero-length check in unbuffered write
netfs: Fix i_dio_count leak on DIO read past i_size
At btrfs_use_block_rsv() we read the size of a block reserve without
locking its spinlock, which makes KCSAN complain because the size of a
block reserve is always updated while holding its spinlock. The report
from KCSAN is the following:
[653.313148] BUG: KCSAN: data-race in btrfs_update_delayed_refs_rsv [btrfs] / btrfs_use_block_rsv [btrfs]
[653.314755] read to 0x000000017f5871b8 of 8 bytes by task 7519 on cpu 0:
[653.314779] btrfs_use_block_rsv+0xe4/0x2f8 [btrfs]
[653.315606] btrfs_alloc_tree_block+0xdc/0x998 [btrfs]
[653.316421] btrfs_force_cow_block+0x220/0xe38 [btrfs]
[653.317242] btrfs_cow_block+0x1ac/0x568 [btrfs]
[653.318060] btrfs_search_slot+0xda2/0x19b8 [btrfs]
[653.318879] btrfs_del_csums+0x1dc/0x798 [btrfs]
[653.319702] __btrfs_free_extent.isra.0+0xc24/0x2028 [btrfs]
[653.320538] __btrfs_run_delayed_refs+0xd3c/0x2390 [btrfs]
[653.321340] btrfs_run_delayed_refs+0xae/0x290 [btrfs]
[653.322140] flush_space+0x5e4/0x718 [btrfs]
[653.322958] btrfs_preempt_reclaim_metadata_space+0x102/0x2f8 [btrfs]
[653.323781] process_one_work+0x3b6/0x838
[653.323800] worker_thread+0x75e/0xb10
[653.323817] kthread+0x21a/0x230
[653.323836] __ret_from_fork+0x6c/0xb8
[653.323855] ret_from_fork+0xa/0x30
[653.323887] write to 0x000000017f5871b8 of 8 bytes by task 576 on cpu 3:
[653.323906] btrfs_update_delayed_refs_rsv+0x1a4/0x250 [btrfs]
[653.324699] btrfs_add_delayed_data_ref+0x468/0x6d8 [btrfs]
[653.325494] btrfs_free_extent+0x76/0x120 [btrfs]
[653.326280] __btrfs_mod_ref+0x6a8/0x6b8 [btrfs]
[653.327064] btrfs_dec_ref+0x50/0x70 [btrfs]
[653.327849] walk_up_proc+0x236/0xa50 [btrfs]
[653.328633] walk_up_tree+0x21c/0x448 [btrfs]
[653.329418] btrfs_drop_snapshot+0x802/0x1328 [btrfs]
[653.330205] btrfs_clean_one_deleted_snapshot+0x184/0x238 [btrfs]
[653.330995] cleaner_kthread+0x2b0/0x2f0 [btrfs]
[653.331781] kthread+0x21a/0x230
[653.331800] __ret_from_fork+0x6c/0xb8
[653.331818] ret_from_fork+0xa/0x30
So add a helper to get the size of a block reserve while holding the lock.
Reading the field while holding the lock instead of using the data_race()
annotation is used in order to prevent load tearing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At space_info.c we have several places where we access the ->reserved
field of a block reserve without taking the block reserve's spinlock
first, which makes KCSAN warn about a data race since that field is
always updated while holding the spinlock.
The reports from KCSAN are like the following:
[117.193526] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / need_preemptive_reclaim [btrfs]
[117.195148] read to 0x000000017f587190 of 8 bytes by task 6303 on cpu 3:
[117.195172] need_preemptive_reclaim+0x222/0x2f0 [btrfs]
[117.195992] __reserve_bytes+0xbb0/0xdc8 [btrfs]
[117.196807] btrfs_reserve_metadata_bytes+0x4c/0x120 [btrfs]
[117.197620] btrfs_block_rsv_add+0x78/0xa8 [btrfs]
[117.198434] btrfs_delayed_update_inode+0x154/0x368 [btrfs]
[117.199300] btrfs_update_inode+0x108/0x1c8 [btrfs]
[117.200122] btrfs_dirty_inode+0xb4/0x140 [btrfs]
[117.200937] btrfs_update_time+0x8c/0xb0 [btrfs]
[117.201754] touch_atime+0x16c/0x1e0
[117.201789] filemap_read+0x674/0x728
[117.201823] btrfs_file_read_iter+0xf8/0x410 [btrfs]
[117.202653] vfs_read+0x2b6/0x498
[117.203454] ksys_read+0xa2/0x150
[117.203473] __s390x_sys_read+0x68/0x88
[117.203495] do_syscall+0x1c6/0x210
[117.203517] __do_syscall+0xc8/0xf0
[117.203539] system_call+0x70/0x98
[117.203579] write to 0x000000017f587190 of 8 bytes by task 11 on cpu 0:
[117.203604] btrfs_block_rsv_release+0x2e8/0x578 [btrfs]
[117.204432] btrfs_delayed_inode_release_metadata+0x7c/0x1d0 [btrfs]
[117.205259] __btrfs_update_delayed_inode+0x37c/0x5e0 [btrfs]
[117.206093] btrfs_async_run_delayed_root+0x356/0x498 [btrfs]
[117.206917] btrfs_work_helper+0x160/0x7a0 [btrfs]
[117.207738] process_one_work+0x3b6/0x838
[117.207768] worker_thread+0x75e/0xb10
[117.207797] kthread+0x21a/0x230
[117.207830] __ret_from_fork+0x6c/0xb8
[117.207861] ret_from_fork+0xa/0x30
So add a helper to get the reserved amount of a block reserve while
holding the lock. The value may be not be up to date anymore when used by
need_preemptive_reclaim() and btrfs_preempt_reclaim_metadata_space(), but
that's ok since the worst it can do is cause more reclaim work do be done
sooner rather than later. Reading the field while holding the lock instead
of using the data_race() annotation is used in order to prevent load
tearing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have a sparse file with a trailing hole (from the last extent's end
to i_size) and then create an extent in the file that ends before the
file's i_size, then when doing an incremental send we will issue a write
full of zeroes for the range that starts immediately after the new extent
ends up to i_size. While this isn't incorrect because the file ends up
with exactly the same data, it unnecessarily results in using extra space
at the destination with one or more extents full of zeroes instead of
having a hole. In same cases this results in using megabytes or even
gigabytes of unnecessary space.
Example, reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create 1G sparse file.
xfs_io -f -c "truncate 1G" $MNT/foobar
# Create base snapshot.
btrfs subvolume snapshot -r $MNT $MNT/mysnap1
# Create send stream (full send) for the base snapshot.
btrfs send -f /tmp/1.snap $MNT/mysnap1
# Now write one extent at the beginning of the file and one somewhere
# in the middle, leaving a gap between the end of this second extent
# and the file's size.
xfs_io -c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 512M 128K" \
$MNT/foobar
# Now create a second snapshot which is going to be used for an
# incremental send operation.
btrfs subvolume snapshot -r $MNT $MNT/mysnap2
# Create send stream (incremental send) for the second snapshot.
btrfs send -p $MNT/mysnap1 -f /tmp/2.snap $MNT/mysnap2
# Now recreate the filesystem by receiving both send streams and
# verify we get the same content that the original filesystem had
# and file foobar has only two extents with a size of 128K each.
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -f /tmp/1.snap $MNT
btrfs receive -f /tmp/2.snap $MNT
echo -e "\nFile fiemap in the second snapshot:"
# Should have:
#
# 128K extent at file range [0, 128K[
# hole at file range [128K, 512M[
# 128K extent file range [512M, 512M + 128K[
# hole at file range [512M + 128K, 1G[
xfs_io -r -c "fiemap -v" $MNT/mysnap2/foobar
# File should be using 256K of data (two 128K extents).
echo -e "\nSpace used by the file: $(du -h $MNT/mysnap2/foobar | cut -f 1)"
umount $MNT
Running the test, we can see with fiemap that we get an extent for the
range [512M, 1G[, while in the source filesystem we have an extent for
the range [512M, 512M + 128K[ and a hole for the rest of the file (the
range [512M + 128K, 1G[):
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..2097151]: 2156544..3205119 1048576 0x1
Space used by the file: 513M
This happens because once we finish processing an inode, at
finish_inode_if_needed(), we always issue a hole (write operations full
of zeros) if there's a gap between the end of the last processed extent
and the file's size, even if that range is already a hole in the parent
snapshot. Fix this by issuing the hole only if the range is not already
a hole.
After this change, running the test above, we get the expected layout:
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..1048831]: 26880..27135 256 0x1
3: [1048832..2097151]: hole 1048320
Space used by the file: 256K
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Reported-by: Dorai Ashok S A <dash.btrfs@inix.me>
Link: https://lore.kernel.org/linux-btrfs/c0bf7818-9c45-46a8-b3d3-513230d0c86e@inix.me/
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a zoned filesystem with conventional zones, we're skipping the block
group profile checks for the conventional zones.
This allows converting a zoned filesystem's data block groups to RAID when
all of the zones backing the chunk are on conventional zones. But this
will lead to problems, once we're trying to allocate chunks backed by
sequential zones.
So also check for conventional zones when loading a block group's profile
on them.
Reported-by: HAN Yuwei <hrx@bupt.moe>
Link: https://lore.kernel.org/all/1ACD2E3643008A17+da260584-2c7f-432a-9e22-9d390aae84cc@bupt.moe/#t
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert the coda filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.rst for more information.
Note this is slightly tricky as coda currently only has a binary mount data
interface. This is handled through the parse_monolithic hook.
Also add a more conventional interface with a parameter named "fd" that
takes an fd that refers to a coda psdev, thereby specifying the index to
use.
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: forward port to current upstream mount API interfaces]
Link: https://lore.kernel.org/r/97650eeb-94c7-4041-b58c-90e81e76b699@redhat.com
Tested-by: Jan Harkes <jaharkes@cs.cmu.edu>
Reviewed-by: Ian Kent <raven@themaw.net>
cc: coda@cs.cmu.edu
Signed-off-by: Christian Brauner <brauner@kernel.org>
In '__ep_remove()', prefer 'kfree_rcu()' over 'call_rcu()' with
dummy 'epi_rcu_free()' callback. This follows commit d0089603fa
("fs: prefer kfree_rcu() in fasync_remove_entry()") and should not
be backported to stable as well.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20240221112205.48389-2-dmantipov@yandex.ru
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull simple offset series from Chuck Lever
In an effort to address slab fragmentation issues reported a few
months ago, I've replaced the use of xarrays for the directory
offset map in "simple" file systems (including tmpfs).
Thanks to Liam Howlett for helping me get this working with Maple
Trees.
* series 'Use Maple Trees for simple_offset utilities' of https://lore.kernel.org/r/170820083431.6328.16233178852085891453.stgit@91.116.238.104.host.secureserver.net: (6 commits)
libfs: Convert simple directory offsets to use a Maple Tree
test_maple_tree: testing the cyclic allocation
maple_tree: Add mtree_alloc_cyclic()
libfs: Add simple_offset_empty()
libfs: Define a minimum directory offset
libfs: Re-arrange locking in offset_iterate_dir()
Signed-off-by: Christian Brauner <brauner@kernel.org>
In erofs_find_target_block() when erofs_dirnamecmp() returns 0,
we do not assign the target metabuf. This causes the caller
erofs_namei()'s erofs_put_metabuf() at the end to be not effective
leaving the refcount on the page.
As the page from metabuf (buf->page) is never put, such page cannot be
migrated or reclaimed. Fix it now by putting the metabuf from
previous loop and assigning the current metabuf to target before
returning so caller erofs_namei() can do the final put as it was
intended.
Fixes: 500edd0956 ("erofs: use meta buffers for inode lookup")
Cc: <stable@vger.kernel.org> # 5.18+
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240221210348.3667795-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXV2z8ACgkQxWXV+ddt
WDsudxAAoKcp1DbuOtaOzG/XVnIKt36drK4cwyZnGo9PZ9vlgT6k+T0efto4DkOF
fNWy2d/9iGy9RHy4oxZL6ceb3rcWW0NbhiKHeTPNqL4ZCPa7t6bxMWXSYBh6pYgZ
6EUS6H9Don05F7rQs8rERc+VIW6u1HFTLn4wS1cmlTyTQzZwlk9B2V6KtDtHBi0k
B4CCxY6jX2bl7BncXdYteb13Xjg7+JnWvfSKb7ouSVnL8VEcGG13QkPFNV2Xsoi2
uDDsw+QKBEcPNgBIubBUwbLS5V5vYa1H1meUFJkciaeblHVlVIMN3h7+Y8VNKMTC
qpxEo3Hx6oqmw9LdEIU7WsvFs0JJum2fKOjOx3vr1d3AiyFG6W6lrm1fKwnp3dt7
dHdAYuo8+Q4rirGlMDcEoYqpy7AcV8QqtSYajdrdpB1dqHcHhukSNqJ0dx5lYElU
HtnMXD9vLc4uJDcl9Z1aTWEmB+7nj5HwukSnTqQhgwpCZM7mrz6pe1DuD5iKinG6
Yth9QFhgcbcnchPA+3SOtC6uuje9chHo6L6eTVacnKAKyKgLY8qTTsh3zYQNVhMX
M2aWcAkizq20vKe7JFxs7M/tClyuswTjOP6RYzeayY21Rn8gGwe7uhXhv5MKpEh5
TjXjiixsrxxsyyaED7Kl69i54BvmM/TI35p7Jbx6Ln7PPD8lf8M=
=pXyL
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- Fix a deadlock in fiemap.
There was a big lock around the whole operation that can interfere
with a page fault and mkwrite.
Reducing the lock scope can also speed up fiemap
- Fix range condition for extent defragmentation which could lead to
worse layout in some cases
* tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix deadlock with fiemap and extent locking
btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
following kernel warning appears:
WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
Call trace:
kiocb_set_cancel_fn+0x9c/0xa8
ffs_epfile_read_iter+0x144/0x1d0
io_read+0x19c/0x498
io_issue_sqe+0x118/0x27c
io_submit_sqes+0x25c/0x5fc
__arm64_sys_io_uring_enter+0x104/0xab0
invoke_syscall+0x58/0x11c
el0_svc_common+0xb4/0xf4
do_el0_svc+0x2c/0xb0
el0_svc+0x2c/0xa4
el0t_64_sync_handler+0x68/0xb4
el0t_64_sync+0x1a4/0x1a8
Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
submitted by libaio.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test robot reports:
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>
> commit: a2e459555c ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
Feng Tang further clarifies that:
> ... the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
Willy's analysis is that, over time, the test workload causes
xa_alloc_cyclic() to fragment the underlying SLAB cache.
This patch replaces the offset_ctx's xarray with a Maple Tree in the
hope that Maple Tree's dense node mode will handle this scenario
more scalably.
In addition, we can widen the simple directory offset maximum to
signed long (as loff_t is also signed).
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
For simple filesystems that use directory offset mapping, rely
strictly on the directory offset map to tell when a directory has
no children.
After this patch is applied, the emptiness test holds only the RCU
read lock when the directory being tested has no children.
In addition, this adds another layer of confirmation that
simple_offset_add/remove() are working as expected.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/170820143463.6328.7872919188371286951.stgit@91.116.238.104.host.secureserver.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
Liam and Matthew say that once the RCU read lock is released,
xa_state is not safe to re-use for the next xas_find() call. But the
RCU read lock must be released on each loop iteration so that
dput(), which might_sleep(), can be called safely.
Thus we are forced to walk the offset tree with fresh state for each
directory entry. xa_find() can do this for us, though it might be a
little less efficient than maintaining xa_state locally.
We believe that in the current code base, inode->i_rwsem provides
protection for the xa_state maintained in
offset_iterate_dir(). However, there is no guarantee that will
continue to be the case in the future.
Since offset_iterate_dir() doesn't build xa_state locally any more,
there's no longer a strong need for offset_find_next(). Clean up by
rolling these two helpers together.
Suggested-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Message-ID: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/170820142021.6328.15047865406275957018.stgit@91.116.238.104.host.secureserver.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
processed: The number of bytes processed by the body in the
most recent iteration, or a negative errno. 0 causes the iteration to
stop.
The processed is useful to check when the loop breaks.
Signed-off-by: Kassey Li <quic_yingangl@quicinc.com>
Link: https://lore.kernel.org/r/20240219021138.3481763-1-quic_yingangl@quicinc.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since commit fd07e0aa23c4 ("iomap: map multiple blocks at a time"), we
could map multi-blocks once a time, and the dirty_len indicates the
expected map length, map_len won't large than it. The pos and dirty_len
means the dirty range that should be mapped to write, add them into
trace_iomap_writepage_map() could be more useful for debug.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240220115759.3445025-1-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The FL_POSIX check in __locks_insert_block was inadvertantly broken
recently and is now inserting only OFD locks instead of only legacy
POSIX locks.
This breaks deadlock detection in POSIX locks, and may also be the root
cause of a performance regression noted by the kernel test robot.
Restore the proper sense of the test.
Fixes: b6be371400 ("filelock: convert __locks_insert_block, conflict and deadlock checks to use file_lock_core")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202402181229.f8147f40-oliver.sang@intel.com
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240218-flsplit4-v1-1-26454fc090f2@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
A while ago, we changed the way that select() and poll() preallocate
a temporary buffer just under the size of the static warning limit of
1024 bytes, as clang was frequently going slightly above that limit.
The warnings have recently returned and I took another look. As it turns
out, clang is not actually inherently worse at reserving stack space,
it just happens to inline do_select() into core_sys_select(), while gcc
never inlines it.
Annotate do_select() to never be inlined and in turn remove the special
case for the allocation size. This should give the same behavior for
both clang and gcc all the time and once more avoids those warnings.
Fixes: ad312f95d4 ("fs/select: avoid clang stack usage warning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240216202352.2492798-1-arnd@kernel.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
if you have a variable that holds NULL or a pointer to live struct mount,
do not shove ERR_PTR() into it - not if you later treat "not NULL" as
"holds a pointer to object".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the following file extent layout, defrag would do unnecessary IO
and result more on-disk space usage.
# mkfs.btrfs -f $dev
# mount $dev $mnt
# xfs_io -f -c "pwrite 0 40m" $mnt/foobar
# sync
# xfs_io -f -c "pwrite 40m 16k" $mnt/foobar
# sync
Above command would lead to the following file extent layout:
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 298844160 nr 41943040
extent data offset 0 nr 41943040 ram 41943040
extent compression 0 (none)
item 7 key (257 EXTENT_DATA 41943040) itemoff 15763 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13631488 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0 (none)
Which is mostly fine. We can allow the final 16K to be merged with the
previous 40M, but it's upon the end users' preference.
But if we defrag the file using the default parameters, it would result
worse file layout:
# btrfs filesystem defrag $mnt/foobar
# sync
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 298844160 nr 41943040
extent data offset 0 nr 8650752 ram 41943040
extent compression 0 (none)
item 7 key (257 EXTENT_DATA 8650752) itemoff 15763 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 340787200 nr 33292288
extent data offset 0 nr 33292288 ram 33292288
extent compression 0 (none)
item 8 key (257 EXTENT_DATA 41943040) itemoff 15710 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13631488 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0 (none)
Note the original 40M extent is still there, but a new 32M extent is
created for no benefit at all.
[CAUSE]
There is an existing check to make sure we won't defrag a large enough
extent (the threshold is by default 32M).
But the check is using the length to the end of the extent:
range_len = em->len - (cur - em->start);
/* Skip too large extent */
if (range_len >= extent_thresh)
goto next;
This means, for the first 8MiB of the extent, the range_len is always
smaller than the default threshold, and would not be defragged.
But after the first 8MiB, the remaining part would fit the requirement,
and be defragged.
Such different behavior inside the same extent caused the above problem,
and we should avoid different defrag decision inside the same extent.
[FIX]
Instead of using @range_len, just use @em->len, so that we have a
consistent decision among the same file extent.
Now with this fix, we won't touch the extent, thus not making it any
worse.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 0cb5950f3f ("btrfs: fix deadlock when reserving space during defrag")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Eric Hong found that when using ftruncate to expand an empty file,
exfat_ent_set() will fail if discontinuous clusters are allocated.
The reason is that the empty file does not have a cluster chain,
but exfat_ent_set() attempts to append the newly allocated cluster
to the cluster chain. In addition, exfat_find_last_cluster() only
supports finding the last cluster in a non-empty file.
So this commit adds a check whether the file is empty. If the file
is empty, exfat_find_last_cluster() and exfat_ent_set() are no longer
called as they do not need to be called.
Fixes: f55c096f62 ("exfat: do not zero the extended part")
Reported-by: Eric Hong <erichong@qnap.com>
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXP2xEACgkQiiy9cAdy
T1EGMwwAiclDnZEtlHRK80Kmncnk7JaLnmHqLLfgIocVo0MDnaUrJVftj43F2OP/
za1uJfI9uM+lgM+vbEjldLxMf8yVnt6MpgzwXBZftppFpVPAwiwvGh4m4RAKreUj
SnE/WEINjj3p3dTJdnKup3Ff2hDXm9VBLpGPWwUTQ0RNWROO7hUtnTyeGRCZKEgy
uCVVU8AS5DAvBpuWZ+K0zG/omts42fP399GkQUD5Pz8DgOaFCMfQU33/zPFQEkPX
idnXH3KL4SHTFvytTTRCgHGg2x5MLersC6zYqwlgTD9YkGpr1UbuIRYvVO793Bek
VqhQTsnSnQSTuJ5lCH7elcLhF8XPF/qM+sRw/OoidgUwMMDjN0jMkwxBk+MX8Gw2
6lJfWAjTjRhanF6s501fEGd4YPFIBAysmW1XLiS12ot05s5k9D7sp+EjklPsSxvX
VrHHtcElY9rwh5sk7gZsmBl2U2x8k4ZvZeq6dIqQAvg+oQAbi9mTFrGuxsC3jD/x
LP2cFzhW
=v3wq
-----END PGP SIGNATURE-----
Merge tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Five smb3 client fixes, most also for stable:
- Two multichannel fixes (one to fix potential handle leak on retry)
- Work around possible serious data corruption (due to change in
folios in 6.3, for cases when non standard maximum write size
negotiated)
- Symlink creation fix
- Multiuser automount fix"
* tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: Fix regression in writes when non-standard maximum write size negotiated
smb: client: handle path separator of created SMB symlinks
smb: client: set correct id, uid and cruid for multiuser automounts
cifs: update the same create_guid on replay
cifs: fix underflow in parse_server_interfaces()
responding to mclientcaps(revoke)" stalls on the MDS side.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmXPnBgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi33XCACBiglCuzqv5/MTU7W/CaWOGYUL9OT2
dcP6lkFyexuVl7yjbiAwnBbAiefMr5jgBK27+20ZdT7VDzrtBeDB18al/QMv7r+0
TSIbUW3nLIph2LdodgKypJ6WOHPEpi4OTncFTlkfERDNQR3GXRDWJkI9pQWcRiYr
DTz0FvvMkDNitoHlXdD3RhEQ8M2gdoT5HXyns4YdCjc7aZekkwjkoG4Yf+/BWLUy
3v/2lcTdW6e6u6Pqu5I9xq+bnir6F9FIsERW1TaZfFwksQr/IMdJs0DTWzfwh26v
wJlyYYguSAC2/kJg52HWfVvtszjWvlpDj81AZn8HxgU4+MQoBKGd18FY
=YWdh
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Additional cap handling fixes from Xiubo to avoid "client isn't
responding to mclientcaps(revoke)" stalls on the MDS side"
* tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client:
ceph: add ceph_cap_unlink_work to fire check_caps() immediately
ceph: always queue a writeback when revoking the Fb caps
- Fix direct write error handling to avoid a race between failed IO
completion and the submission path itself which can result in an
invalid file size exposed to the user after the failed IO.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZc9S/AAKCRDdoc3SxdoY
dgilAQDhQeRxzZLXO5lh5LGeqveo88kXuQclCK9VeqnCr0cnHQD/RTXvo464Vf4c
DAuDtLxRA16sj8WlLkUVkvjMKdjYaQ8=
=n1Tp
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
- Fix direct write error handling to avoid a race between failed IO
completion and the submission path itself which can result in an
invalid file size exposed to the user after the failed IO.
* tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Improve error handling
The conversion to netfs in the 6.3 kernel caused a regression when
maximum write size is set by the server to an unexpected value which is
not a multiple of 4096 (similarly if the user overrides the maximum
write size by setting mount parm "wsize", but sets it to a value that
is not a multiple of 4096). When negotiated write size is not a
multiple of 4096 the netfs code can skip the end of the final
page when doing large sequential writes, causing data corruption.
This section of code is being rewritten/removed due to a large
netfs change, but until that point (ie for the 6.3 kernel until now)
we can not support non-standard maximum write sizes.
Add a warning if a user specifies a wsize on mount that is not
a multiple of 4096 (and round down), also add a change where we
round down the maximum write size if the server negotiates a value
that is not a multiple of 4096 (we also have to check to make sure that
we do not round it down to zero).
Reported-by: R. Diez" <rdiez-2006@rd10.de>
Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Suggested-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Acked-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # v6.3+
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:
1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).
2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.
Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXMewsACgkQxWXV+ddt
WDtFUBAAkEU/hxB4YsLn2JEdp3wc80w5/qKkPaYHsI2ncvc3RFiG+tqSY7BakMgE
Kkdl8ouNX3p/S62ykIBQTKZnOTk7FgKlClAQtgKn1afexqABsP2mifnh40Dzf7eA
VvEl7chnRT6oeivtQkB+BtgOzaOUp4j/8oAivRN8NKNwTxojV4g9PErKSOWfVQSq
3zlrLJbe6era43SpnexkjZHn4Fy4CN+C7FMm+pT/yKzZi2oBZs9BvNZGhIkdnzcK
MftrY9dSGO3CDD2Kvrz3lEm7ZB83wCpm+GTDN7iJx2y+yeW+aHjshFkJr1ApEZQa
lsWTnj3hk3yHoOPUuLlchw5JcFb/dFZ1Ztdwkunf8nmt5a3O/5Zf+Csgze8c+Iii
MJQKi0B/bNQ7cSEwRt36s75kROBItZmHCZmSBlOpT1LXSDQMJ9lvEnv/fPQdcHHF
WMEmk5O5IoGYv5kx5wIoWv27HKE/bDwH6RjkxEd/n17XP+PcfHY4K0o0CGtfwS8g
hdy9RI9X8dbf3ZPrxtsgQ2T8btWs68A4S6nwcSuY5HK0WNmvRh47eLfCI6S6XGJs
hHkppLcc+WTXOskCA+ABdm9hgeAPZkCSpuQSmC2HBt8gRv8XqO7z4cZ/up2N+tES
ZOJSrJb97nusOcxY0pLexnD6eI3pQxzGMiPONlC1Re8CdjZ0l+4=
=RRGT
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regular fixes and one fix for space reservation regression since
6.7 that users have been reporting:
- fix over-reservation of metadata chunks due to not keeping proper
balance between global block reserve and delayed refs reserve; in
practice this leaves behind empty metadata block groups, the
workaround is to reclaim them by using the '-musage=1' balance
filter
- other space reservation fixes:
- do not delete unused block group if it may be used soon
- do not reserve space for checksums for NOCOW files
- fix extent map assertion failure when writing out free space inode
- reject encoded write if inode has nodatasum flag set
- fix chunk map leak when loading block group zone info"
* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't refill whole delayed refs block reserve when starting transaction
btrfs: zoned: fix chunk map leak when loading block group zone info
btrfs: reject encoded write if inode has nodatasum flag set
btrfs: don't reserve space for checksums when writing to nocow files
btrfs: add new unused block groups to the list of unused block groups
btrfs: do not delete unused block group if it may be used soon
btrfs: add and use helper to check if block group is used
btrfs: don't drop extent_map for free space inode on write error
Convert the zonefs filesystem to use the new mount API.
Tested using the zonefs test suite from:
https://github.com/damien-lemoal/zonefs-tools
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
When also downgrading, check_version_upgrade() could pick a new version
greater than the latest supported version.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This prevents going emergency read only when the user has specified
replicas_required > replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since commit 28270e25c6 ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.
This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.
So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.
CC: stable@vger.kernel.org # 6.7+
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Reported-by: Heddxh <g311571057@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_load_block_group_zone_info() we never drop a reference on the
chunk map we have looked up, therefore leaking a reference on it. So
add the missing btrfs_free_chunk_map() at the end of the function.
Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.
For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
root 256 inode 257 errors 1040, bad file extent, some csum missing
root 256 inode 258 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
(...)
So reject encoded writes if the target inode has NODATASUM set.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when doing a write to a file we always reserve metadata space
for inserting data checksums. However we don't need to do it if we have
a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums
are disabled (-o nodatasum mount option), as in that case we are only
adding unnecessary pressure to metadata reservations.
For example on x86_64, with the default node size of 16K, a 4K buffered
write into a nodatacow file is reserving 655360 bytes of metadata space,
as it's accounting for checksums. After this change, which stops reserving
space for checksums if we have a nodatacow file or checksums are disabled,
we only need to reserve 393216 bytes of metadata.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.
This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we
will always ignore queue a writeback. Queue a writeback is very
important because it will block kclient flushing the snapcaps to
MDS and which will block MDS waiting for revoking the 'Fb' caps.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Convert path separator to CIFS_DIR_SEP(cifs_sb) from symlink target
before sending it over the wire otherwise the created SMB symlink may
become innaccesible from server side.
Fixes: 514d793e27 ("smb: client: allow creating symlinks via reparse points")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When uid, gid and cruid are not specified, we need to dynamically
set them into the filesystem context used for automounting otherwise
they'll end up reusing the values from the parent mount.
Fixes: 9fd29a5bae ("cifs: use fs_context for automounts")
Reported-by: Shane Nehring <snehring@iastate.edu>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2259257
Cc: stable@vger.kernel.org # 6.2+
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use
memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can
drop the gfp_mask parameter from blkdev_zone_mgmt() as well as
blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guard the calls to blkdev_zone_mgmt() with a memalloc_nofs scope.
This helps us getting rid of the GFP_NOFS argument to blkdev_zone_mgmt();
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-4-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a memalloc_nofs scope around all calls to blkdev_zone_mgmt(). This
allows us to further get rid of the GFP_NOFS argument for
blkdev_zone_mgmt().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-3-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in
zonefs_zone_mgmt().
As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called
from a place that can recurse back into the filesystem on memory reclaim,
it is save to call blkdev_zone_mgmt() with GFP_KERNEL.
Link: https://lore.kernel.org/all/ZZcgXI46AinlcBDP@casper.infradead.org/
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-1-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZcoMdAAKCRCRxhvAZXjc
ogy4AQDVp4huR6BBnRMhOCZbIsmkuHmq6ynpIZNTTAM0DdMn5AEAlJ03aEIaG9WS
RQMdaYajeVpZfR/vIUg8UdVkHQxOEgw=
=akNF
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix performance regression introduced by moving the security
permission hook out of do_clone_file_range() and into its caller
vfs_clone_file_range().
This causes the security hook to be called in situation were it
wasn't called before as the fast permission checks were left in
do_clone_file_range().
Fix this by merging the two implementations back together and
restoring the old ordering: fast permission checks first, expensive
ones later.
- Tweak mount_setattr() permission checking so that mount properties on
the real rootfs can be changed.
When we added mount_setattr() we added additional checks compared to
legacy mount(2). If the mount had a parent then verify that the
caller and the mount namespace the mount is attached to match and if
not make sure that it's an anonymous mount.
But the real rootfs falls into neither category. It is neither an
anoymous mount because it is obviously attached to the initial mount
namespace but it also obviously doesn't have a parent mount. So that
means legacy mount(2) allows changing mount properties on the real
rootfs but mount_setattr(2) blocks this. This causes regressions (See
the commit for details).
Fix this by relaxing the check. If the mount has a parent or if it
isn't a detached mount, verify that the mount namespaces of the
caller and the mount are the same. Technically, we could probably
write this even simpler and check that the mount namespaces match if
it isn't a detached mount. But the slightly longer check makes it
clearer what conditions one needs to think about.
* tag 'vfs-6.8-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: relax mount_setattr() permission checks
remap_range: merge do_clone_file_range() into vfs_clone_file_range()
Add a new ioctl for getting the sysfs name of a filesystem - the path
under /sys/fs.
This is going to let us standardize exporting data from sysfs across
filesystems, e.g. time stats.
The returned path will always be of the form "$FSTYP/$SYSFS_IDENTIFIER",
where the sysfs identifier may be a UUID (for bcachefs) or a device name
(xfs).
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-6-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use a more descriptive explanation of the @opf function parameter,
more in line with <linux/blk_types.h>.
Fixes: 02105f18a2 ("fs/hfsplus: wrapper.c: fix kernel-doc warnings")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240210050606.9182-1-rdunlap@infradead.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
File open requests made to the server contain a
CreateGuid, which is used by the server to identify
the open request. If the same request needs to be
replayed, it needs to be sent with the same CreateGuid
in the durable handle v2 context.
Without doing so, we could end up leaking handles on
the server when:
1. multichannel is used AND
2. connection goes down, but not for all channels
This is because the replayed open request would have a
new CreateGuid and the server will treat this as a new
request and open a new handle.
This change fixes this by reusing the existing create_guid
stored in the cached fid struct.
REF: MS-SMB2 4.9 Replay Create Request on an Alternate Channel
Fixes: 4f1fffa237 ("cifs: commands that are retried should have replay flag set")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In this loop, we step through the buffer and after each item we check
if the size_left is greater than the minimum size we need. However,
the problem is that "bytes_left" is type ssize_t while sizeof() is type
size_t. That means that because of type promotion, the comparison is
done as an unsigned and if we have negative bytes left the loop
continues instead of ending.
Fixes: fe856be475 ("CIFS: parse and store info on iface queries")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
issues or aren't considered to be needed in earlier kernel versions.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZcfLvgAKCRDdBJ7gKXxA
joCTAP4/XdBXA7Sj3GyjSAkYjg2U0quwX9oRhsx2Qy9duPDaLAD+NRl9XG14YSOB
f/7OiTQoDfnwVgHAOVBHY/ylrcgZRQg=
=2wdS
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions"
* tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
nilfs2: fix potential bug in end_buffer_async_write
mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup
nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
MAINTAINERS: Leo Yan has moved
mm/zswap: don't return LRU_SKIP if we have dropped lru lock
fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
mailmap: switch email address for John Moon
mm: zswap: fix objcg use-after-free in entry destruction
mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros
mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
mm: memcg: optimize parent iteration in memcg_rstat_updated()
nilfs2: fix data corruption in dsync block recovery for small block sizes
mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get()
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats
fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand()
getrusage: use sig->stats_lock rather than lock_task_sighand()
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
...
During xfstest tests, there are some kmemleak reports e.g. generic/051 with
if USE_KMEMLEAK=yes:
====================================================================
EXPERIMENTAL kmemleak reported some memory leaks! Due to the way kmemleak
works, the leak might be from an earlier test, or something totally unrelated.
unreferenced object 0xffff9ef905aaf778 (size 8):
comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
hex dump (first 8 bytes):
a5 cc cc cc cc cc cc cc ........
backtrace:
[<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
[<ffffffff87f49b66>] kmalloc_trace+0x26/0xb0
[<ffffffffc0a3fefe>] __bch2_read_super+0xfe/0x4e0 [bcachefs]
[<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
[<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
[<ffffffff88080c90>] legacy_get_tree+0x30/0x60
[<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
[<ffffffff88061fe5>] path_mount+0x475/0xb60
[<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
[<ffffffff88932642>] do_syscall_64+0x42/0xf0
[<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
unreferenced object 0xffff9ef96cdc4fc0 (size 32):
comm "mount.bcachefs", pid 169844, jiffies 4295281209 (age 87.040s)
hex dump (first 32 bytes):
2f 64 65 76 2f 6d 61 70 70 65 72 2f 74 65 73 74 /dev/mapper/test
2d 31 00 cc cc cc cc cc cc cc cc cc cc cc cc cc -1..............
backtrace:
[<ffffffff87fd9a43>] __kmem_cache_alloc_node+0x1f3/0x2c0
[<ffffffff87f4a081>] __kmalloc_node_track_caller+0x51/0x150
[<ffffffff87f3adc2>] kstrdup+0x32/0x60
[<ffffffffc0a3ff1a>] __bch2_read_super+0x11a/0x4e0 [bcachefs]
[<ffffffffc0a3ad22>] bch2_fs_open+0x262/0x1710 [bcachefs]
[<ffffffffc09c9e24>] bch2_mount+0x4c4/0x640 [bcachefs]
[<ffffffff88080c90>] legacy_get_tree+0x30/0x60
[<ffffffff8802c748>] vfs_get_tree+0x28/0xf0
[<ffffffff88061fe5>] path_mount+0x475/0xb60
[<ffffffff880627e5>] __x64_sys_mount+0x105/0x140
[<ffffffff88932642>] do_syscall_64+0x42/0xf0
[<ffffffff88a000e6>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
====================================================================
The leak happens if bdev_open_by_path() failed to open a block device
then it goes label 'out' directly without call of bch2_free_super().
Fix it by going to label 'err' instead of 'out' if bdev_open_by_path()
fails.
Signed-off-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXGtzoACgkQiiy9cAdy
T1G9UQv/b1rOI+u7Cr5RDnO0O4sbL7bJ7pfJHEK0KKpat0BFtsrGZFRwDsuSDmkc
BMIdeENnM1aoGjGEzvyGJmzUEZUcusy2zFdLBDiW1zPBb5D5HLRmr7fN02ZwPwj9
5vnuvM5/Iql/dSMBjDcm7M5NuiVlp9+SmN27OqXbfc0e6xHxnzhwu6A3x3Ryaz/J
0LzxNt++UUkZfK6FrePDdRyWlvBHsMy4RfTmjIO432bhNjsx90YHrPtKj2ph4xi5
/92QuLJvSaYyj1IrZIV6v0UBJBKtnoGek8UJ7k3Mz/BkHBXvvZTR0MYL/tKW80eK
Bfck2qcRVauLPseGRnn5GTkvF+itTb5RXksXzVSomveAzQ7TAle/qx7EL93QKCLC
vPJLAXK00T0JvE0zyVxGuPWvl9iWBUwbR4uwUL4XNnJksIXsTYci7TZ0ELyAA7hJ
bdn/4DyRTS5KXC60JwE9hcGXpjYstD6w8Jz+UseADsS+qE3zuX0UwnynNCQc0zjy
iTboUnA1
=exTN
-----END PGP SIGNATURE-----
Merge tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Two ksmbd server fixes:
- memory leak fix
- a minor kernel-doc fix"
* tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: free aux buffer if ksmbd_iov_pin_rsp_read fails
ksmbd: Add kernel-doc for ksmbd_extract_sharename() function
This reverts commit 5785160732.
Unfortunately, while we only call that thing once, the callback
*can* be called more than once for the same dentry - all it
takes is rename_lock being touched while we are in d_walk().
For now let's revert it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmXGuQsACgkQiiy9cAdy
T1FJHgwAgvfgXuEjLXzjcEUg7bZ7z76BDC4Qq0NJImkJ6GUm9eQssCF8xbOZlfb/
bxUiZATGGbRso3cLKIcZtgBOkyKx8v9ZTVbjxpYJQqqliAvYNHXfi3TlMsjDwGlM
Nd4kVboZmHJSMirR3O915Il4iOt36/RygDiLHqqE6jG+BM74I3fpOI+wtphUIEdG
KHRczjbTlmKoZDH99Np/CYGYKiQOcFLOj7FetiYBW3AS1H2qSol5PDO0vOgvSDFq
3QOIRN1Km5tRogHx/hgr993DYamvDHTI3GbSEwDT45zP1m13AHLFf7tPrPW9vNwQ
1LIRqTFp7UYJElHGUZMYkPwY9ryqU1GHNekiV/JqUJyzvZ6wLC0mHJnY4Sh+xpdP
wUyTCdZITZ6KPw+bQnITISkQvB3Z8lp6xX4OQZYusICLzAqfsGVI8qM2AMWM5Ywv
9BEQb1cY6iolcQZ1nYl1/yJTVy9KZHK+N/f/8GCrF/7MDEJn2CKyIUT+Zz8f1K+F
lS6Dm8XU
=jOZH
-----END PGP SIGNATURE-----
Merge tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- reconnect fix
- multichannel channel selection fix
- minor mount warning fix
- reparse point fix
- null pointer check improvement
* tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: clarify mount warning
cifs: handle cases where multiple sessions share connection
cifs: change tcon status when need_reconnect is set on it
smb: client: set correct d_type for reparse points under DFS mounts
smb3: add missing null server pointer check
files) and two cap handling fixes from Xiubo and Rishabh.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmXGRJATHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi+/0B/4pEweAm2W0UUaaS59DecNySBFobwed
m7bBDBGIAQ/I3duN46a13FzsGNclho967TeB0ig1jrQxnoo3HEMiXpZz5xfG9spe
fyvrIk3R8cSqgd7YsyITnUjGGd2UBvZVrbWOCbWrKofSoflS6IjcGDQF7ZrgEsff
0KkMaWHvO6poIU2mAToV//UkWUk6RrtAUNlSdjLpizXnUrrAQ+vUA3OU9SSp6Klf
xmFaIiAiVZC6M8qFpXJtnIf8Ba7PrpW5InAXgCOkxDKciE9fLaPsIu0B3H9lUVKZ
TJwjEJ0nB+akh0tRO5bZKyM8j0D3lhgxphJwNtUoYjQsV3m7LcGQV+Il
=u953
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Some fscrypt-related fixups (sparse reads are used only for encrypted
files) and two cap handling fixes from Xiubo and Rishabh"
* tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client:
ceph: always check dir caps asynchronously
ceph: prevent use-after-free in encode_cap_msg()
ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT
libceph: just wait for more data to be available on the socket
libceph: rename read_sparse_msg_*() to read_partial_sparse_msg_*()
libceph: fail sparse-read if the data length doesn't match
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmXGIMMACgkQqbAzH4Mk
B7b07xAAw/VcHqPdhcVg2SttGy1D0rpjXvSK9Na3pulD83M3AptvjqXToP1xZHUP
fpZ32rAWoeaJieTbS+xJUEyRj+VGN9iEgUMBtoaIYIv9ozmC7IU0xyDvJCPU07F1
X2n+IXxpBsU/Y1rQiJIijzum+BQYgXgsifdwkZU50QQjQWWcFdMU9VPjN2Saw/Sx
8gd1rzKVKIclErpReDyuZpTqDweM4BxiuwKhLodzlMtfO2MEqXxwFXnLQDX2xJLh
zJPMepM/3mzHSBjKrHQ+xZHZCDuP393UUJK+sd9PaETR8xHR4ew9yqSu1Ajg/o64
xoQ8rkpT9g6AZS+JNKtKN52rw5rn4ZCi/VZ61HgqiLGTxOkVnHpynmDr3IKzfgn+
j7ZD33HteBJLxnR3YTi7fJA8DF9d0vHUv+HtH351WVibJn9DrzWzIkp6uDdaVfoa
YvVE+ODynLVvpDKVTm4QOmIRnVMFDZwNo7C2sURy6nqQYf+ufYYRbe5btrvhSZ8k
uazJLhLSLFCHiT6WlbmykntTo15sub/yIF5juVRcWthi2jWj0qII549jtSkZquQR
pEVcitMTrr6RqwEB/B5nsz2azQ4m/+JgO0se1kWvxa+6erVV0wCdB7STW77zbRmx
m8Xyr8Pf+ZxM+IhP4cpSxgcc5olhvUjcrkNBtilQc0vLqf535k0=
=4Gkn
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:
"Fixed:
- size update for compressed file
- some logic errors, overflows
- memory leak
- some code was refactored
Added:
- implement super_operations::shutdown
Improved:
- alternative boot processing
- reduced stack usage"
* tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3: (28 commits)
fs/ntfs3: Slightly simplify ntfs_inode_printk()
fs/ntfs3: Add ioctl operation for directories (FITRIM)
fs/ntfs3: Fix oob in ntfs_listxattr
fs/ntfs3: Fix an NULL dereference bug
fs/ntfs3: Update inode->i_size after success write into compressed file
fs/ntfs3: Fixed overflow check in mi_enum_attr()
fs/ntfs3: Correct function is_rst_area_valid
fs/ntfs3: Use i_size_read and i_size_write
fs/ntfs3: Prevent generic message "attempt to access beyond end of device"
fs/ntfs3: use non-movable memory for ntfs3 MFT buffer cache
fs/ntfs3: Use kvfree to free memory allocated by kvmalloc
fs/ntfs3: Disable ATTR_LIST_ENTRY size check
fs/ntfs3: Fix c/mtime typo
fs/ntfs3: Add NULL ptr dereference checking at the end of attr_allocate_frame()
fs/ntfs3: Add and fix comments
fs/ntfs3: ntfs3_forced_shutdown use int instead of bool
fs/ntfs3: Implement super_operations::shutdown
fs/ntfs3: Drop suid and sgid bits as a part of fpunch
fs/ntfs3: Add file_modified
fs/ntfs3: Correct use bh_read
...
When a user tries to use the "sec=krb5p" mount parameter to encrypt
data on connection to a server (when authenticating with Kerberos), we
indicate that it is not supported, but do not note the equivalent
recommended mount parameter ("sec=krb5,seal") which turns on encryption
for that mount (and uses Kerberos for auth). Update the warning message.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Based on our implementation of multichannel, it is entirely
possible that a server struct may not be found in any channel
of an SMB session.
In such cases, we should be prepared to move on and search for
the server struct in the next session.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a tcon is marked for need_reconnect, the intention
is to have it reconnected.
This change adjusts tcon->status in cifs_tree_connect
when need_reconnect is set. Also, this change has a minor
correction in resetting need_reconnect on success. It makes
sure that it is done with tc_lock held.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Space reservations for metadata are, most of the time, pessimistic as we
reserve space for worst possible cases - where tree heights are at the
maximum possible height (8), we need to COW every extent buffer in a tree
path, need to split extent buffers, etc.
For data, we generally reserve the exact amount of space we are going to
allocate. The exception here is when using compression, in which case we
reserve space matching the uncompressed size, as the compression only
happens at writeback time and in the worst possible case we need that
amount of space in case the data is not compressible.
This means that when there's not available space in the corresponding
space_info object, we may need to allocate a new block group, and then
that block group might not be used after all. In this case the block
group is never added to the list of unused block groups and ends up
never being deleted - except if we unmount and mount again the fs, as
when reading block groups from disk we add unused ones to the list of
unused block groups (fs_info->unused_bgs). Otherwise a block group is
only added to the list of unused block groups when we deallocate the
last extent from it, so if no extent is ever allocated, the block group
is kept around forever.
This also means that if we have a bunch of tasks reserving space in
parallel we can end up allocating many block groups that end up never
being used or kept around for too long without being used, which has
the potential to result in ENOSPC failures in case for example we over
allocate too many metadata block groups and then end up in a state
without enough unallocated space to allocate a new data block group.
This is more likely to happen with metadata reservations as of kernel
6.7, namely since commit 28270e25c6 ("btrfs: always reserve space for
delayed refs when starting transaction"), because we started to always
reserve space for delayed references when starting a transaction handle
for a non-zero number of items, and also to try to reserve space to fill
the gap between the delayed block reserve's reserved space and its size.
So to avoid this, when finishing the creation a new block group, add the
block group to the list of unused block groups if it's still unused at
that time. This way the next time the cleaner kthread runs, it will delete
the block group if it's still unused and not needed to satisfy existing
space reservations.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper function to determine if a block group is being used and make
use of it at btrfs_delete_unused_bgs(). This helper will also be used in
future code changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running the CI for an unrelated change I hit the following panic
with generic/648 on btrfs_holes_spacecache.
assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1385!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1
RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
Call Trace:
<TASK>
extent_write_cache_pages+0x2ac/0x8f0
extent_writepages+0x87/0x110
do_writepages+0xd5/0x1f0
filemap_fdatawrite_wbc+0x63/0x90
__filemap_fdatawrite_range+0x5c/0x80
btrfs_fdatawrite_range+0x1f/0x50
btrfs_write_out_cache+0x507/0x560
btrfs_write_dirty_block_groups+0x32a/0x420
commit_cowonly_roots+0x21b/0x290
btrfs_commit_transaction+0x813/0x1360
btrfs_sync_file+0x51a/0x640
__x64_sys_fdatasync+0x52/0x90
do_syscall_64+0x9c/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
This happens because we fail to write out the free space cache in one
instance, come back around and attempt to write it again. However on
the second pass through we go to call btrfs_get_extent() on the inode to
get the extent mapping. Because this is a new block group, and with the
free space inode we always search the commit root to avoid deadlocking
with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
requested range.
This happens because the first time we try to write the space cache out
we hit an error, and on an error we drop the extent mapping. This is
normal for normal files, but the free space cache inode is special. We
always expect the extent map to be correct. Thus the second time
through we end up with a bogus extent map.
Since we're deprecating this feature, the most straightforward way to
fix this is to simply skip dropping the extent map range for this failed
range.
I shortened the test by using error injection to stress the area to make
it easier to reproduce. With this patch in place we no longer panic
with my error injection test.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.
do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new generic ioctls for querying the filesystem UUID.
These are lifted versions of the ext4 ioctls, with one change: we're not
using a flexible array member, because UUIDs will never be more than 16
bytes.
This patch adds a generic implementation of FS_IOC_GETFSUUID, which
reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
that can be done on offline filesystems by the people who need it,
trying to do it online is just asking for too much trouble.
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-4-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
We don't want to be settingc sb->s_uuid directly anymore, as there's a
length field that also has to be set, and this conversion was not
completely trivial.
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-3-kent.overstreet@linux.dev
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.
And add a helper super_set_uuid(), for setting nonstandard length uuids.
Helper is now required for the new FS_IOC_GETUUID ioctl; if
super_set_uuid() hasn't been called, the ioctl won't be supported.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
Send query dir requests with an info level of
SMB_FIND_FILE_FULL_DIRECTORY_INFO rather than
SMB_FIND_FILE_DIRECTORY_INFO when the client is generating its own
inode numbers (e.g. noserverino) so that reparse tags still
can be parsed directly from the responses, but server won't
send UniqueId (server inode number)
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Address static checker warning in cifs_ses_get_chan_index():
warn: variable dereferenced before check 'server'
To be consistent, and reduce risk, we should add another check
for null server pointer.
Fixes: 88675b22d3 ("cifs: do not search for channel if server is terminating")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently, it is possible to create an idmapped mount using a user
namespace without any mappings. However, this yields an idmapped
mount that doesn't actually map the ids. With the following change,
it will no longer be possible to create an idmapped mount when using
a user namespace with no mappings, and will instead return EINVAL,
an “invalid argument” error code.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Taylor Jackson <taylor.a.jackson@me.com>
Link: https://lore.kernel.org/r/20240208-mnt-idmap-inval-v2-1-58ef26d194e0@me.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since eventfd's document has clearly stated: A write(2) call adds
the 8-byte integer value supplied in its buffer to the counter.
However, in the current implementation, the following code snippet
did not cause an error:
char str[16] = "hello world";
uint64_t value;
ssize_t size;
int fd;
fd = eventfd(0, 0);
size = write(fd, &str, strlen(str));
printf("eventfd: test writing a string, size=%ld\n", size);
size = read(fd, &value, sizeof(value));
printf("eventfd: test reading as uint64, size=%ld, valus=0x%lX\n",
size, value);
close(fd);
And its output is:
eventfd: test writing a string, size=8
eventfd: test reading as uint64, size=8, valus=0x6F77206F6C6C6568
By checking whether count is equal to sizeof(ucnt), such errors
could be detected. It also follows the requirements of the manual.
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Link: https://lore.kernel.org/r/tencent_10AAA44731FFFA493F9F5501521F07DD4D0A@qq.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Eric Biggers <ebiggers@google.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
According to a syzbot report, end_buffer_async_write(), which handles the
completion of block device writes, may detect abnormal condition of the
buffer async_write flag and cause a BUG_ON failure when using nilfs2.
Nilfs2 itself does not use end_buffer_async_write(). But, the async_write
flag is now used as a marker by commit 7f42ec3941 ("nilfs2: fix issue
with race condition of competition between segments for dirty blocks") as
a means of resolving double list insertion of dirty blocks in
nilfs_lookup_dirty_data_buffers() and nilfs_lookup_node_buffers() and the
resulting crash.
This modification is safe as long as it is used for file data and b-tree
node blocks where the page caches are independent. However, it was
irrelevant and redundant to also introduce async_write for segment summary
and super root blocks that share buffers with the backing device. This
led to the possibility that the BUG_ON check in end_buffer_async_write
would fail as described above, if independent writebacks of the backing
device occurred in parallel.
The use of async_write for segment summary buffers has already been
removed in a previous change.
Fix this issue by removing the manipulation of the async_write flag for
the remaining super root block buffer.
Link: https://lkml.kernel.org/r/20240203161645.4992-1-konishi.ryusuke@gmail.com
Fixes: 7f42ec3941 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+5c04210f7c7f897c1e7f@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000019a97c05fd42f8c8@google.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported a hang issue in migrate_pages_batch() called by mbind()
and nilfs_lookup_dirty_data_buffers() called in the log writer of nilfs2.
While migrate_pages_batch() locks a folio and waits for the writeback to
complete, the log writer thread that should bring the writeback to
completion picks up the folio being written back in
nilfs_lookup_dirty_data_buffers() that it calls for subsequent log
creation and was trying to lock the folio. Thus causing a deadlock.
In the first place, it is unexpected that folios/pages in the middle of
writeback will be updated and become dirty. Nilfs2 adds a checksum to
verify the validity of the log being written and uses it for recovery at
mount, so data changes during writeback are suppressed. Since this is
broken, an unclean shutdown could potentially cause recovery to fail.
Investigation revealed that the root cause is that the wait for writeback
completion in nilfs_page_mkwrite() is conditional, and if the backing
device does not require stable writes, data may be modified without
waiting.
Fix these issues by making nilfs_page_mkwrite() wait for writeback to
finish regardless of the stable write requirement of the backing device.
Link: https://lkml.kernel.org/r/20240131145657.4209-1-konishi.ryusuke@gmail.com
Fixes: 1d1d1a7672 ("mm: only enforce stable page writes if the backing device requires it")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+ee2ae68da3b22d04cd8d@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000047d819061004ad6c@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The helper function nilfs_recovery_copy_block() of
nilfs_recovery_dsync_blocks(), which recovers data from logs created by
data sync writes during a mount after an unclean shutdown, incorrectly
calculates the on-page offset when copying repair data to the file's page
cache. In environments where the block size is smaller than the page
size, this flaw can cause data corruption and leak uninitialized memory
bytes during the recovery process.
Fix these issues by correcting this byte offset calculation on the page.
Link: https://lkml.kernel.org/r/20240124121936.10575-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
do_task_stat() at the same time and the process has NR_THREADS, it will
spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change do_task_stat() to use sig->stats_lock to gather the statistics
outside of ->siglock protected section, in the likely case this code will
run lockless.
Link: https://lkml.kernel.org/r/20240123153357.GA21857@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fs/proc: do_task_stat: use sig->stats_".
do_task_stat() has the same problem as getrusage() had before "getrusage:
use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If
NR_CPUS threads call lock_task_sighand() at the same time and the process
has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS *
NR_THREADS) time.
This patch (of 3):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
Not only this removes for_each_thread() from the critical section with
irqs disabled, this removes another case when stats_lock is taken with
siglock held. We want to remove this dependency, then we can change the
users of stats_lock to not disable irqs.
Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com
Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in
shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages
are not reserved. However when the shared memory is attached with the
shmat() call the hugetlb pages are getting reserved incorrectly for
SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug.
-------------------------------
Following test shows the issue.
$cat shmhtb.c
int main()
{
int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE;
int shmid;
shmid = shmget(SKEY, SHMSZ, shmflags);
if (shmid < 0)
{
printf("shmat: shmget() failed, %d\n", errno);
return 1;
}
printf("After shmget()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmat(shmid, NULL, 0);
printf("\nAfter shmat()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmctl(shmid, IPC_RMID, NULL);
return 0;
}
#sysctl -w vm.nr_hugepages=20
#./shmhtb
After shmget()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 0
HugePages_Surp: 0
After shmat()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 5 <--
HugePages_Surp: 0
--------------------------------
Fix is to ensure that hugetlb pages are not reserved for SHM_HUGETLB shared
memory in the shmat() call.
Link: https://lkml.kernel.org/r/1706040282-12388-1-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ksmbd_iov_pin_rsp_read() doesn't free the provided aux buffer if it
fails. Seems to be the caller's responsibility to clear the buffer in
error case.
Found by Linux Verification Center (linuxtesting.org).
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The ksmbd_extract_sharename() function lacked a complete kernel-doc
comment. This patch adds parameter descriptions and detailed function
behavior to improve code readability and maintainability.
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When we added mount_setattr() I added additional checks compared to the
legacy do_reconfigure_mnt() and do_change_type() helpers used by regular
mount(2). If that mount had a parent then verify that the caller and the
mount namespace the mount is attached to match and if not make sure that
it's an anonymous mount.
The real rootfs falls into neither category. It is neither an anoymous
mount because it is obviously attached to the initial mount namespace
but it also obviously doesn't have a parent mount. So that means legacy
mount(2) allows changing mount properties on the real rootfs but
mount_setattr(2) blocks this. I never thought much about this but of
course someone on this planet of earth changes properties on the real
rootfs as can be seen in [1].
Since util-linux finally switched to the new mount api in 2.39 not so
long ago it also relies on mount_setattr() and that surfaced this issue
when Fedora 39 finally switched to it. Fix this.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2256843
Link: https://lore.kernel.org/r/20240206-vfs-mount-rootfs-v1-1-19b335eee133@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Karel Zak <kzak@redhat.com>
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
- Address a deadlock regression in RELEASE_LOCKOWNER
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXDkGoACgkQM2qzM29m
f5cD1xAAsqKTmrb2ABdFZadYfVl6lKxtNWEp8S9eRS6eBU6bcNr5sxoTi+eflHuB
a58TfUwj8ffVtmd0qGaWI8RpDuAYtxhJU6l3TQZCLheMlKvsP3u6lZDjk8CeYtIK
9bVDZV7MOh9C/p01p/21P2B3OukgzzF8Lz/AFPCxCtoK5nnaDT5F+8pX8yfZU5x0
bM1gBHzB80BoCdzlimN6QB8EfFkOgIF9Apnk53E676KmkOuADV+CIp4aQMsm8l3r
6lJAdsOCuw5BgSDJWh1vGmFKubfhP841QbglDnnZcc+WTBhvEO0PR8Kv0Ugn5Xek
8NkcnOlti+gjH0EKTHx5P4frV0BcWptLjCjVdruOvBszrZtEaX547Mp/d04GGO0U
8gUEhen/RT7l1E2RM4qII9q5nuaekjI2Da2FGZNmK/j66OPFibiOBMRn3u/haTr5
2axrrO9NOnfWcTQX08iKZygEUY7E4h3iImqkNw+c+avZ6SRiH548TRCrKdlBeuia
Pj3QFRfu/9PQnnl9qnROXtAP70AKmX1iSLZ4s9+KqTzQVCW7tLbZpOn9D1hHCJ1q
0JTVql5rSXCl5YpBsDiqr7qvZhVab/2+1X9wHkLPZd5aIBoUmOQOMebZCUxpu+TR
houx3NBGYNixk0khEkHH3+c0HydvnxVU3xClUgLxjHgEdeS7R6s=
=z3oC
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Address a deadlock regression in RELEASE_LOCKOWNER
* tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: don't take fi_lock in nfsd_break_deleg_cb()
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:
"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."
So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was
caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This
implies before the refcount could be increment here, it was freed.
In same file, in "handle_cap_grant()" refcount is decremented by this
line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race
occurred and resource was freed by the latter line before the former
line could increment it.
encode_cap_msg() is called by __send_cap() and __send_cap() is called by
ceph_check_caps() after calling __prep_cap(). __prep_cap() is where
arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where
the refcount must be increased to prevent "use after free" error.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/59259
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.
Link: https://tracker.ceph.com/issues/64035
Fixes: 5b11888471 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXDNuAACgkQxWXV+ddt
WDvBGg/9FuCJm/GkBxgeVKNxdF28fIzYkYHjSYzHSo5A5GFMNENHUDfXcSNjZjUM
ZFWHCXENcnNa7pKONPaW5QIIQuecBPqcXK+lPJXlqFlC22CGSVD7MZ7/Fm7uKJ5W
mhGGuq7NTuTN1MYm480WVa+5DkfVbFkPeZgWOVTQ0tXGxTEKU9pXvwmflx8rbmRG
VPhT0iZO/KmkRSp91BwAJxitw8v76WG9JpGemiFcNOISCdE/HENxrxj8rE6beZoc
g0Kx8YQDTlf119bdwlCdJkvRVEzjIEZIUE2g8J0oKzPE6CmY2a+8+Iv/S0nCCc6V
2nFVHdhLnUH5oIuFEoo026tvu3tMKR1K30EAQyFslsjPE74Hye7MAjr8sEvAF7E/
J4Sbn3NIILkKu1Ozn/RqhPh+XsSyU9tXeO1+BcdmrGY9vDGq18lVbruOrde14fqZ
xHFJloXKsJCw7AcMzNfsa6arRQ7YGa8sGudMLpriUemUUn0MK8OdY6zCq20p43ON
8eUigP3WHOdPfCJXNfgqlJdyjmYdHCWvn4wKpPDMQuU5rMyUloOJDqtR6fxVCatO
0Pjg0zVyLu/CF6+vrL6wP4qT9sRj1Jy2YEh8fFe4fWc9+JOmQZYBm/Eyaw4oU0rg
lOmqE1/TEgl0ra9IHvxcgJo5l7zx2dbHAgMEmScCgIwrLpkh14A=
=nMOd
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two fixes preventing deletion and manual creation of subvolume qgroup
- unify error code returned for unknown send flags
- fix assertion during subvolume creation when anonymous device could
be allocated by other thread (e.g. due to backref walk)
* tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not ASSERT() if the newly created subvolume already got read
btrfs: forbid deleting live subvol qgroup
btrfs: forbid creating subvol qgroups
btrfs: send: return EOPNOTSUPP on unknown flags
Now that the key quotas are handled immediately on key_put() instead of
being postponed to the key management garbage collection worker, a call
to keyring_clear() is all that is required in fscrypt_put_master_key()
so that the keyring clean-up is also done synchronously. This patch
should fix the fstest generic/581 flakiness.
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20240206101619.8083-1-lhenriques@suse.de
[ebiggers: added comment]
Signed-off-by: Eric Biggers <ebiggers@google.com>
commit dfad37051a ("remap_range: move permission hooks out of
do_clone_file_range()") moved the permission hooks from
do_clone_file_range() out to its caller vfs_clone_file_range(),
but left all the fast sanity checks in do_clone_file_range().
This makes the expensive security hooks be called in situations
that they would not have been called before (e.g. fs does not support
clone).
The only reason for the do_clone_file_range() helper was that overlayfs
did not use to be able to call vfs_clone_file_range() from copy up
context with sb_writers lock held. However, since commit c63e56a4a6
("ovl: do not open/llseek lower file with upper sb_writers held"),
overlayfs just uses an open coded version of vfs_clone_file_range().
Merge_clone_file_range() into vfs_clone_file_range(), restoring the
original order of checks as it was before the regressing commit and adapt
the overlayfs code to call vfs_clone_file_range() before the permission
hooks that were added by commit ca7ab48240 ("ovl: add permission hooks
outside of do_splice_direct()").
Note that in the merge of do_clone_file_range(), the file_start_write()
context was reduced to cover ->remap_file_range() without holding it
over the permission hooks, which was the reason for doing the regressing
commit in the first place.
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401312229.eddeb9a6-oliver.sang@intel.com
Fixes: dfad37051a ("remap_range: move permission hooks out of do_clone_file_range()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202102258.1582671-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The only remaining user of ->d_real() method is d_real_inode(), which
passed NULL inode argument to get the real data dentry.
There are no longer any users that call ->d_real() with a non-NULL
inode argument for getting a detry from a specific underlying layer.
Remove the inode argument of the method and replace it with an integer
'type' argument, to allow callers to request the real metadata dentry
instead of the real data dentry.
All the current users of d_real_inode() (e.g. uprobe) continue to get
the real data inode. Caller that need to get the real metadata inode
(e.g. IMA/EVM) can use d_inode(d_real(dentry, D_REAL_METADATA)).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202110132.1584111-3-amir73il@gmail.com
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f76 ("block: Remove
request.write_hint") and commit c75e707fe1 ("block: remove the
per-bio/request write hint").
This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:
/* size: 112, cachelines: 2, members: 20 */
/* sum members: 110, holes: 1, sum holes: 2 */
/* last cacheline: 48 bytes */
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Write hints applied with F_SET_RW_HINT on a block device affect the
block device inode only. Propagate these hints to the inode associated
with struct block_device because that is the inode used when writing
back dirty pages.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-6-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move enum rw_hint into a new header file to prepare for using this data
type in the block layer. Add the attribute __packed to reduce the space
occupied by instances of this data type from four bytes to one byte.
Change the data type of i_write_hint from u8 into enum rw_hint.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chao Yu <chao@kernel.org> # for the F2FS part
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-5-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Split fcntl_rw_hint() such that there is one helper function per fcntl.
Use READ_ONCE() and WRITE_ONCE() to access the i_write_hint member
instead of protecting such accesses with the inode lock. READ_ONCE() is
not used in I/O path code that reads i_write_hint. Users who want
F_SET_RW_HINT to affect I/O need to make sure that F_SET_RW_HINT has
completed before I/O is submitted that should use the configured write
hint.
Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-4-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The code in fs/fcntl.c converts RWH_* constants to and from WRITE_LIFE_*
constants using casts. Verify at compile time that these casts will yield
the intended effect.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-3-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reject values that are valid rw_hints after truncation but not before
truncation by passing an untruncated value to rw_hint_valid().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 5657cb0797 ("fs/fcntl: use copy_to/from_user() for u64 types")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since commit 443f1a0422 ("lift the calls of ep_send_events_proc()
into the callers"), ep_scan_ready_list() has been removed.
But there are still several in comments. All of them should
be replaced with other caller functions.
Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com>
Link: https://lore.kernel.org/r/20240206014353.4191262-1-huangxiaojia2@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Two serious ones here that we'll want to backport to stable: a fix for a
race in the thread_with_file code, and another locking fixup in the
subvolume deletion path.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXBeaYACgkQE6szbY3K
bnb4iA//dByHSjorgEDU/8EZwq6vKSJ5F+M4JqEYTYHP3IG9Fci6dLwCbx/SBFFn
NaYx1mPxSnVCjr1WJEikdUPFW9SAT5Qv/xqqbAKZZ8F6AW1tGM+kK9isV24tpbQx
RYpYZaYObmyHUgd2Fm3hOUltLVkRB2y2wI1SQKbq+31Zk9rYEITIdhe7mXOsD40f
Uo6KJGVATDC8gg6S3bG3ZNG7AUqOx8HErcsrErQCxJpwbbYWzpz+3QbVZNRaID04
AGRrasJcjk6RZ1LM9jasjN+i4mPOrZBS4gyXHg3YcH7+Ft4uGz//afVBU8AcnaX+
zMP12LwFRLZul1Q5lNVcI3t5lJLd2JlWiMssruZcY6tCFmyWBS8DOA5F1PJwy171
8pEcKkNhjgs/LMcpc/GPL09jRjBXYI6yM14nm1HbRE9oMm3HamyZJ0ThSQn5m63x
4wLpjlCfgBI96qPTXLUR4FQzqWe0TNI/pMkMLl0L4wcvqCQSxJ2UwTEjkaDF7hif
G98XPG8G2wgDqWKXD8u/su57E/X1vj893Y11QsMHcp0XwmYRiebsP+kR6sKU8m44
CggyC7CUEhfZM7XmmQFnhOXMMvF9f7BCS4MpvAUSuPzwYWRE8CpUpbBUjQFFGZ9w
h6er30CjOhjkc0JOT416mNvXusO3h9P4x/pyWZ33iQ1fS58ckXU=
=iour
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Two serious ones here that we'll want to backport to stable: a fix for
a race in the thread_with_file code, and another locking fixup in the
subvolume deletion path"
* tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs:
bcachefs: time_stats: Check for last_event == 0 when updating freq stats
bcachefs: install fd later to avoid race with close
bcachefs: unlock parent dir if entry is not found in subvolume deletion
bcachefs: Fix build on parisc by avoiding __multi3()
Static FDPIC executable may get an executable stack even when it has
non-executable GNU_STACK segment. This happens when STACK segment has rw
permissions, but does not specify stack size. In that case FDPIC loader
uses permissions of the interpreter's stack, and for static executables
with no interpreter it results in choosing the arch-default permissions
for the stack.
Fix that by using the interpreter's properties only when the interpreter
is actually used.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20240118150637.660461-1-jcmvbkbc@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
A recent change to check_for_locks() changed it to take ->flc_lock while
holding ->fi_lock. This creates a lock inversion (reported by lockdep)
because there is a case where ->fi_lock is taken while holding
->flc_lock.
->flc_lock is held across ->fl_lmops callbacks, and
nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However
it doesn't need to.
Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each
delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list
and so needed the lock. Since then it doesn't walk the list and doesn't
need the lock.
Two actions are performed under the lock. One is to call
nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on
the nfs4_file at all, so don't need the lock.
The other is to set ->fi_had_conflict which is in the nfs4_file.
This field is only ever set here (except when initialised to false)
so there is no possible problem will multiple threads racing when
setting it.
The field is tested twice in nfs4_set_delegation(). The first test does
not hold a lock and is documented as an opportunistic optimisation, so
it doesn't impose any need to hold ->fi_lock while setting
->fi_had_conflict.
The second test in nfs4_set_delegation() *is* make under ->fi_lock, so
removing the locking when ->fi_had_conflict is set could make a change.
The change could only be interesting if ->fi_had_conflict tested as
false even though nfsd_break_one_deleg() ran before ->fi_lock was
unlocked. i.e. while hash_delegation_locked() was running.
As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb()
there can be no importance to this interaction.
So this patch removes the locking from nfsd_break_one_deleg() and moves
the final test on ->fi_had_conflict out of the locked region to make it
clear that locking isn't important to the test. It is still tested
*after* vfs_setlease() has succeeded. This might be significant and as
vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called
under ->flc_lock this "after" is a true ordering provided by a spinlock.
Fixes: edcf972515 ("nfsd: fix RELEASE_LOCKOWNER")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Zdenek reported seeing some AVC denials due to nfsd trying to set
delegations:
type=AVC msg=audit(09.11.2023 09:03:46.411:496) : avc: denied { lease } for pid=5127 comm=rpc.nfsd capability=lease scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:nfsd_t:s0 tclass=capability permissive=0
When setting delegations on behalf of nfsd, we don't want to do all of
the normal capabilty and LSM checks. nfsd is a kernel thread and runs
with CAP_LEASE set, so the uid checks end up being a no-op in most cases
anyway.
Some nfsd functions can end up running in normal process context when
tearing down the server. At that point, the CAP_LEASE check can fail and
cause the client to not tear down delegations when expected.
Also, the way the per-fs ->setlease handlers work today is a little
convoluted. The non-trivial ones are wrappers around generic_setlease,
so when they fail due to permission problems they usually they end up
doing a little extra work only to determine that they can't set the
lease anyway. It would be more efficient to do those checks earlier.
Transplant the permission checking from generic_setlease to
vfs_setlease, which will make the permission checking happen earlier on
filesystems that have a ->setlease operation. Add a new kernel_setlease
function that bypasses these checks, and switch nfsd to use that instead
of vfs_setlease.
There is one behavioral change here: prior this patch the
setlease_notifier would fire even if the lease attempt was going to fail
the security checks later. With this change, it doesn't fire until the
caller has passed them. I think this is a desirable change overall. nfsd
is the only user of the setlease_notifier and it doesn't benefit from
being notified about failed attempts.
Cc: Ondrej Mosnáček <omosnacek@gmail.com>
Reported-by: Zdenek Pytela <zpytela@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2248830
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240205-bz2248830-v1-1-d0ec0daecba1@kernel.org
Acked-by: Tom Talpey <tom@talpey.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new struct file_lease and move the lease-specific fields from
struct file_lock to it. Convert the appropriate API calls to take
struct file_lease instead, and convert the callers to use them.
There is zero overlap between the lock manager operations for file
locks and the ones for file leases, so split the lease-related
operations off into a new lease_manager_operations struct.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-47-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-45-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-44-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-43-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-42-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-41-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-40-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-39-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-38-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-37-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-36-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-35-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-34-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reduce some pointer manipulation by just using file_lock_core where we
can and only translate to a file_lock when needed.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-33-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
locks_translate_pid is used on both locks and leases, so have that take
struct file_lock_core.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-32-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Have these functions take a file_lock_core pointer instead of a
file_lock.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-31-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Have locks_wake_up_blocks take a file_lock_core pointer, and fix up the
callers to pass one in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-30-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rename the old __locks_delete_block to __locks_unlink_lock. Rename
change old locks_delete_block function to __locks_delete_block and
have it take a file_lock_core. Make locks_delete_block a simple wrapper
around __locks_delete_block.
Also, change __locks_insert_block to take struct file_lock_core, and
fix up its callers.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-28-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rework the internals of locks_delete_block to use struct file_lock_core
(mostly just for clarity's sake). The prototype is not changed.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-27-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Both locks and leases deal with fl_blocker. Switch the fl_blocker
pointer in struct file_lock_core to point to the file_lock_core of the
blocker instead of a file_lock structure.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-26-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Have both __locks_insert_block and the deadlock and conflict checking
functions take a struct file_lock_core pointer instead of a struct
file_lock one. Also, change posix_locks_deadlock to return bool.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-25-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert __locks_delete_block and __locks_wake_up_blocks to take a struct
file_lock_core pointer.
While we could do this in another way, we're going to need to add a
file_lock() helper function later anyway, so introduce and use it now.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-24-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Have locks_insert_global_blocked and locks_delete_global_blocked take a
struct file_lock_core pointer.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-23-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert these functions to take a file_lock_core instead of a file_lock.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-22-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert posix_owner_key to take struct file_lock_core pointer, and fix
up the callers to pass one in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-21-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change posix_same_owner to take struct file_lock_core pointers, and
convert the callers to pass those in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-20-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert more internal fs/locks.c functions to take and deal with struct
file_lock_core instead of struct file_lock:
- locks_dump_ctx_list
- locks_check_ctx_file_list
- locks_release_private
- locks_owner_has_blockers
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-19-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert fs/locks.c to access fl_core fields direcly rather than using
the backward-compatibility macros. Most of this was done with
coccinelle, with a few by-hand fixups.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-18-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.
For now, add some macros to ensure that we can continue to build while
the conversion is in progress.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
These don't add a lot of value over just open-coding the flag check.
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-16-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert to using the new file locking helper functions. Also, in later
patches we're going to introduce some macros with names that clash with
the variable names in nfsd4_lock. Rename them.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-12-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert to using the new file locking helper functions. Also, in later
patches we're going to introduce some temporary macros with names that
clash with the variable name in nfs4_proc_unlck. Rename it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-11-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert to using the new file locking helper functions. Also in later
patches we're going to introduce some macros with names that clash with
the variable names in nlmclnt_lock. Rename them.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-10-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>