IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
You can't call idr_remove() from within a idr_for_each() callback,
but you can call xa_erase() from an xa_for_each() loop, so switch the
entire personality_idr from the IDR to the XArray. This manifests as a
use-after-free as idr_for_each() attempts to walk the rest of the node
after removing the last entry from it.
Fixes: 071698e13ac6 ("io_uring: allow registering credentials")
Cc: stable@vger.kernel.org # 5.6+
Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[Pavel: rebased (creds load was moved into io_init_req())]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ccff36e1375f2b0ebf73d957f037b43becc0dde.1615212806.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are enough of problems with IORING_SETUP_R_DISABLED, including the
burden of checking and kicking off the SQO task all over the codebase --
for exit/cancel/etc.
Rework it, always start the thread but don't do submit unless the flag
is gone, that's much easier.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq now is per-task, so cancellations now should match against
request's ctx.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We keep running into weird dependency issues between the sqd lock and
the parking state. Disentangle the SQPOLL thread from the last bits of
the kthread parking inheritance, and just replace the parking state,
and two associated locks, with a single rw mutex. The SQPOLL thread
keeps the mutex for read all the time, except if someone has marked us
needing to park. Then we drop/re-acquire and try again.
This greatly simplifies the parking state machine (by just getting rid
of it), and makes it a lot more obvious how it works - if you need to
modify the ctx list, then you simply park the thread which will grab
the lock for writing.
Fold in fix from Hillf Danton on not setting STOP on a fatal signal.
Fixes: e54945ae947f ("io_uring: SQPOLL stop error handling fixes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An xattr 'get' handler is expected to return the length of the value on
success, yet _nfs4_get_security_label() (and consequently also
nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns
just 0 on success.
Fix this by returning label.len instead, which contains the length of
the result.
Fixes: aa9c2669626c ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A cleanup of the inter SSC copy needs to call fput() of the source
file handle to make sure that file structure is freed as well as
drop the reference on the superblock to unmount the source server.
Fixes: 36e1e5ba90fb ("NFSD: Fix use-after-free warning when doing inter-server copy")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Nowadays, we indirectly use the idmap-aware helper functions in the VFS
to set the initial uid and gid of a file being created. Unfortunately,
we didn't convert the quota code, which means we attach the wrong dquots
to files created on an idmapped mount.
Fixes: f736d93d76d3 ("xfs: support idmapped mounts")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In case if isi.nr_pages is 0, we are making sis->pages (which is
unsigned int) a huge value in iomap_swapfile_activate() by assigning -1.
This could cause a kernel crash in kernel v4.18 (with below signature).
Or could lead to unknown issues on latest kernel if the fake big swap gets
used.
Fix this issue by returning -EINVAL in case of nr_pages is 0, since it
is anyway a invalid swapfile. Looks like this issue will be hit when
we have pagesize < blocksize type of configuration.
I was able to hit the issue in case of a tiny swap file with below
test script.
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/scripts/swap-issue.sh
kernel crash analysis on v4.18
==============================
On v4.18 kernel, it causes a kernel panic, since sis->pages becomes
a huge value and isi.nr_extents is 0. When 0 is returned it is
considered as a swapfile over NFS and SWP_FILE is set (sis->flags |= SWP_FILE).
Then when swapoff was getting called it was calling a_ops->swap_deactivate()
if (sis->flags & SWP_FILE) is true. Since a_ops->swap_deactivate() is
NULL in case of XFS, it causes below panic.
Panic signature on v4.18 kernel:
=======================================
root@qemu:/home/qemu# [ 8291.723351] XFS (loop2): Unmounting Filesystem
[ 8292.123104] XFS (loop2): Mounting V5 Filesystem
[ 8292.132451] XFS (loop2): Ending clean mount
[ 8292.263362] Adding 4294967232k swap on /mnt1/test/swapfile. Priority:-2 extents:1 across:274877906880k
[ 8292.277834] Unable to handle kernel paging request for instruction fetch
[ 8292.278677] Faulting instruction address: 0x00000000
cpu 0x19: Vector: 400 (Instruction Access) at [c0000009dd5b7ad0]
pc: 0000000000000000
lr: c0000000003eb9dc: destroy_swap_extents+0xfc/0x120
sp: c0000009dd5b7d50
msr: 8000000040009033
current = 0xc0000009b6710080
paca = 0xc00000003ffcb280 irqmask: 0x03 irq_happened: 0x01
pid = 5604, comm = swapoff
Linux version 4.18.0 (riteshh@xxxxxxx) (gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)) #57 SMP Wed Mar 3 01:33:04 CST 2021
enter ? for help
[link register ] c0000000003eb9dc destroy_swap_extents+0xfc/0x120
[c0000009dd5b7d50] c0000000025a7058 proc_poll_event+0x0/0x4 (unreliable)
[c0000009dd5b7da0] c0000000003f0498 sys_swapoff+0x3f8/0x910
[c0000009dd5b7e30] c00000000000bbe4 system_call+0x5c/0x70
Exception: c01 (System Call) at 00007ffff7d208d8
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
[djwong: rework the comment to provide more details]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This reverts commit 94415b06eb8aed13481646026dc995f04a3a534a.
That commit claimed to allow a client to get a read delegation when it
was the only writer. Actually it allowed a client to get a read
delegation when *any* client has a write open!
The main problem is that it's depending on nfs4_clnt_odstate structures
that are actually only maintained for pnfs exports.
This causes clients to miss writes performed by other clients, even when
there have been intervening closes and opens, violating close-to-open
cache consistency.
We can do this a different way, but first we should just revert this.
I've added pynfs 4.1 test DELEG19 to test for this, as I should have
done originally!
Cc: stable@vger.kernel.org
Reported-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This reverts commit 50747dd5e47b "nfsd4: remove check_conflicting_opens
warning", as a prerequisite for reverting 94415b06eb8a, which has a
serious bug.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In case of interrupted syscalls, prevent sending CLOSE commands for
compound CREATE+CLOSE requests by introducing an
CIFS_CP_CREATE_CLOSE_OP flag to indicate lower layers that it should
not send a CLOSE command to the MIDs corresponding the compound
CREATE+CLOSE request.
A simple reproducer:
#!/bin/bash
mount //server/share /mnt -o username=foo,password=***
tc qdisc add dev eth0 root netem delay 450ms
stat -f /mnt &>/dev/null & pid=$!
sleep 0.01
kill $pid
tc qdisc del dev eth0 root
umount /mnt
Before patch:
...
6 0.256893470 192.168.122.2 → 192.168.122.15 SMB2 402 Create Request File: ;GetInfo Request FS_INFO/FileFsFullSizeInformation;Close Request
7 0.257144491 192.168.122.15 → 192.168.122.2 SMB2 498 Create Response File: ;GetInfo Response;Close Response
9 0.260798209 192.168.122.2 → 192.168.122.15 SMB2 146 Close Request File:
10 0.260841089 192.168.122.15 → 192.168.122.2 SMB2 130 Close Response, Error: STATUS_FILE_CLOSED
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
use its return value rather than always returning 0. Instead, use rc
variable as it is properly set to 0 in case there is no
server->ops->queryfs.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
A customer has reported that their dmesg were being flooded by
CIFS: VFS: \\server Cancelling wait for mid xxx cmd: a
CIFS: VFS: \\server Cancelling wait for mid yyy cmd: b
CIFS: VFS: \\server Cancelling wait for mid zzz cmd: c
because some processes that were performing statfs(2) on the share had
been interrupted due to their automount setup when certain users
logged in and out.
Change it to FYI as they should be mostly informative rather than
error messages.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The MIDs are mostly printed as decimal, so let's make it consistent.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
nfs_set_cache_invalid() has code to handle delegations, and other
optimisations, so let's use it when appropriate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
nfs_set_cache_invalid() has code to handle delegations, and other
optimisations, so let's use it when appropriate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The fact that the lookup revalidation failed, does not mean that the
inode contents have changed.
Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There should be no reason to expect the directory permissions to change
just because the directory contents changed or a negative lookup timed
out. So let's avoid doing a full call to nfs_mark_for_revalidate() in
that case.
Furthermore, if this is a negative dentry, and we haven't actually done
a new lookup, then we have no reason yet to believe the directory has
changed at all. So let's remove the gratuitous directory inode
invalidation altogether when called from
nfs_lookup_revalidate_negative().
Reported-by: Geert Jansen <gerardu@amazon.com>
Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
CREATE requests return a post_op_fh3, rather than nfs_fh3. The
post_op_fh3 includes an extra word to indicate 'handle_follows'.
Without that additional word, create fails when full 64-byte
filehandles are in use.
Add NFS3_post_op_fh_sz, and correct the size calculation for
NFS3_createres_sz.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This follows what was done in 8c2fabc6542d9d0f8b16bd1045c2eda59bdcde13.
With the default being m, it's impossible to build the module into the
kernel.
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Creating a series of detached mounts, attaching them to the filesystem,
and unmounting them can be used to trigger an integer overflow in
ns->mounts causing the kernel to block any new mounts in count_mounts()
and returning ENOSPC because it falsely assumes that the maximum number
of mounts in the mount namespace has been reached, i.e. it thinks it
can't fit the new mounts into the mount namespace anymore.
Depending on the number of mounts in your system, this can be reproduced
on any kernel that supportes open_tree() and move_mount() by compiling
and running the following program:
/* SPDX-License-Identifier: LGPL-2.1+ */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
/* open_tree() */
#ifndef OPEN_TREE_CLONE
#define OPEN_TREE_CLONE 1
#endif
#ifndef OPEN_TREE_CLOEXEC
#define OPEN_TREE_CLOEXEC O_CLOEXEC
#endif
#ifndef __NR_open_tree
#if defined __alpha__
#define __NR_open_tree 538
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_open_tree 4428
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_open_tree 6428
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_open_tree 5428
#endif
#elif defined __ia64__
#define __NR_open_tree (428 + 1024)
#else
#define __NR_open_tree 428
#endif
#endif
/* move_mount() */
#ifndef MOVE_MOUNT_F_EMPTY_PATH
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
#endif
#ifndef __NR_move_mount
#if defined __alpha__
#define __NR_move_mount 539
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_move_mount 4429
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_move_mount 6429
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_move_mount 5429
#endif
#elif defined __ia64__
#define __NR_move_mount (428 + 1024)
#else
#define __NR_move_mount 429
#endif
#endif
static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
{
return syscall(__NR_open_tree, dfd, filename, flags);
}
static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
}
static bool is_shared_mountpoint(const char *path)
{
bool shared = false;
FILE *f = NULL;
char *line = NULL;
int i;
size_t len = 0;
f = fopen("/proc/self/mountinfo", "re");
if (!f)
return 0;
while (getline(&line, &len, f) > 0) {
char *slider1, *slider2;
for (slider1 = line, i = 0; slider1 && i < 4; i++)
slider1 = strchr(slider1 + 1, ' ');
if (!slider1)
continue;
slider2 = strchr(slider1 + 1, ' ');
if (!slider2)
continue;
*slider2 = '\0';
if (strcmp(slider1 + 1, path) == 0) {
/* This is the path. Is it shared? */
slider1 = strchr(slider2 + 1, ' ');
if (slider1 && strstr(slider1, "shared:")) {
shared = true;
break;
}
}
}
fclose(f);
free(line);
return shared;
}
static void usage(void)
{
const char *text = "mount-new [--recursive] <base-dir>\n";
fprintf(stderr, "%s", text);
_exit(EXIT_SUCCESS);
}
#define exit_usage(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
usage(); \
})
#define exit_log(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
exit(EXIT_FAILURE); \
})
static const struct option longopts[] = {
{"help", no_argument, 0, 'a'},
{ NULL, no_argument, 0, 0 },
};
int main(int argc, char *argv[])
{
int exit_code = EXIT_SUCCESS, index = 0;
int dfd, fd_tree, new_argc, ret;
char *base_dir;
char *const *new_argv;
char target[PATH_MAX];
while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
switch (ret) {
case 'a':
/* fallthrough */
default:
usage();
}
}
new_argv = &argv[optind];
new_argc = argc - optind;
if (new_argc < 1)
exit_usage("Missing base directory\n");
base_dir = new_argv[0];
if (*base_dir != '/')
exit_log("Please specify an absolute path");
/* Ensure that target is a shared mountpoint. */
if (!is_shared_mountpoint(base_dir))
exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);
dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
if (dfd < 0)
exit_log("%m - Failed to open base directory \"%s\"", base_dir);
ret = mkdirat(dfd, "detached-move-mount", 0755);
if (ret < 0)
exit_log("%m - Failed to create required temporary directories");
ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
if (ret < 0 || (size_t)ret >= sizeof(target))
exit_log("%m - Failed to assemble target path");
/*
* Having a mount table with 10000 mounts is already quite excessive
* and shoult account even for weird test systems.
*/
for (size_t i = 0; i < 10000; i++) {
fd_tree = sys_open_tree(dfd, "detached-move-mount",
OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC |
AT_EMPTY_PATH);
if (fd_tree < 0) {
fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
if (ret < 0) {
if (errno == ENOSPC)
fprintf(stderr, "%m - Buggy mount counting");
else
fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
close(fd_tree);
ret = umount2(target, MNT_DETACH);
if (ret < 0) {
fprintf(stderr, "%m - Failed to unmount %s", target);
exit_code = EXIT_FAILURE;
break;
}
}
(void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
close(dfd);
exit(exit_code);
}
and wait for the kernel to refuse any new mounts by returning ENOSPC.
How many iterations are needed depends on the number of mounts in your
system. Assuming you have something like 50 mounts on a standard system
it should be almost instantaneous.
The root cause of this is that detached mounts aren't handled correctly
when source and target mount are identical and reside on a shared mount
causing a broken mount tree where the detached source itself is
propagated which propagation prevents for regular bind-mounts and new
mounts. This ultimately leads to a miscalculation of the number of
mounts in the mount namespace.
Detached mounts created via
open_tree(fd, path, OPEN_TREE_CLONE)
are essentially like an unattached new mount, or an unattached
bind-mount. They can then later on be attached to the filesystem via
move_mount() which calls into attach_recursive_mount(). Part of
attaching it to the filesystem is making sure that mounts get correctly
propagated in case the destination mountpoint is MS_SHARED, i.e. is a
shared mountpoint. This is done by calling into propagate_mnt() which
walks the list of peers calling propagate_one() on each mount in this
list making sure it receives the propagation event.
The propagate_one() functions thereby skips both new mounts and bind
mounts to not propagate them "into themselves". Both are identified by
checking whether the mount is already attached to any mount namespace in
mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.
However, detached mounts have an anonymous mount namespace attached to
them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
realize they need to be skipped causing the mount to propagate "into
itself" breaking the mount table and causing a disconnect between the
number of mounts recorded as being beneath or reachable from the target
mountpoint and the number of mounts actually recorded/counted in
ns->mounts ultimately causing an overflow which in turn prevents any new
mounts via the ENOSPC issue.
So teach propagation to handle detached mounts by making it aware of
them. I've been tracking this issue down for the last couple of days and
then verifying that the fix is correct by
unmounting everything in my current mount table leaving only /proc and
/sys mounted and running the reproducer above overnight verifying the
number of mounts counted in ns->mounts. With this fix the counts are
correct and the ENOSPC issue can't be reproduced.
This change will only have an effect on mounts created with the new
mount API since detached mounts cannot be created with the old mount API
so regressions are extremely unlikely.
Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com
Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around")
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Both s390 and alpha have been switched to 64-bit ino_t with
commit 96c0a6a72d18 ("s390,alpha: switch to 64-bit ino_t").
Therefore enable TMPFS_INODE64 for both architectures again.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Link: https://lore.kernel.org/linux-mm/YCV7QiyoweJwvN+m@osiris/
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Martin reported an issue that directory read could be hung on the
latest -rc kernel with some certain image. The root cause is that
commit baa2c7c97153 ("block: set .bi_max_vecs as actual allocated
vector number") changes .bi_max_vecs behavior. bio->bi_max_vecs
is set as actual allocated vector number rather than the requested
number now.
Let's avoid using .bi_max_vecs completely instead.
Link: https://lore.kernel.org/r/20210306040438.8084-1-hsiangkao@aol.com
Reported-by: Martin DEVERA <devik@eaxlabs.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[ Gao Xiang: note that <= 5.11 kernels are not impacted. ]
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
This brings the behavior back in line with what 5.11 and earlier did,
and this is no longer needed with the improved handling of creds
not needing to do unshare().
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_SETUP_ATTACH_WQ we should let __io_sq_thread() use the
initial creds from each ctx.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a simple warning making sure that nobody tries to create a new
manager while we're under IO_WQ_BIT_EXIT. That can potentially happen
due to racy work submission after final put.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_ring_exit_work() have to cancel all requests, including those staying
in io-wq, however it tries only cancellation of current tctx, which is
NULL. If we've got task==NULL, use the ctx-to-tctx map to go over all
tctx/io-wq and try cancellations on them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use system_unbound_wq to run io_ring_exit_work(), so it's hard to
monitor whether removal hang or not. Add WARN_ONCE to catch hangs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't use task file notes anymore, and no need left in indexing
task->io_uring->xa by file, and replace it with ctx. It's better
design-wise, especially since we keep a dangling file, and so have to
keep an eye on not dereferencing it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With ->flush() gone we're now leaving all uring file notes until the
task dies/execs, so the ctx will not be freed until all tasks that have
ever submit a request die. It was nicer with flush but not much, we
could have locked as described ctx in many cases.
Now we guarantee that ctx outlives all tctx in a sense that
io_ring_exit_work() waits for all tctxs to drop their corresponding
enties in ->xa, and ctx won't go away until then. Hence, additional
io_uring file reference (a.k.a. task file notes) are not needed anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Another preparation patch. When full quiesce is done on ctx exit, use
task_work infra to remove corresponding to the ctx io_uring->xa entries.
For that we use the back tctx map. Also use ->in_idle to prevent
removing it while we traversing ->xa on cancellation, just ignore it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For each pair tcxt-ctx create an object and chain it into ctx, so we
have a way to traverse all tctx that are using current ctx. Preparation
patch, will be used later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rework io_uring_del_task_file(), so it accepts an index to delete, and
it's not necessarily have to be in the ->xa. Infer file from xa_erase()
to maintain a single origin of truth.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds code to function trans_drain to remove drained
bd elements from the ail lists, if queued, before freeing the bd.
If we don't remove the bd from the ail, function ail_drain will
try to reference the bd after it has been freed by trans_drain.
Thanks to Andy Price for his analysis of the problem.
Reported-by: Andy Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
It fixes the following warning detected by coccinelle:
./fs/gfs2/super.c:592:5-10: Unneeded variable: "error". Return "0" on
line 628
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The typical result of the backwards comparison here is that the source
server in a server-to-server copy will return BAD_STATEID within a few
seconds of the copy starting, instead of giving the copy a full lease
period, so the copy_file_range() call will end up unnecessarily
returning a short read.
Fixes: 624322f1adc5 "NFSD add COPY_NOTIFY operation"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When NFSD_V4 is enabled and CRYPTO is disabled,
Kbuild gives the following warning:
WARNING: unmet direct dependencies detected for CRYPTO_SHA256
Depends on [n]: CRYPTO [=n]
Selected by [y]:
- NFSD_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFSD [=y] && PROC_FS [=y]
WARNING: unmet direct dependencies detected for CRYPTO_MD5
Depends on [n]: CRYPTO [=n]
Selected by [y]:
- NFSD_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFSD [=y] && PROC_FS [=y]
This is because NFSD_V4 selects CRYPTO_MD5 and CRYPTO_SHA256,
without depending on or selecting CRYPTO, despite those config options
being subordinate to CRYPTO.
Signed-off-by: Julian Braha <julianbraha@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If a file is unhashed, then we're going to reject it anyway and retry,
so make sure we skip it when we're doing the RCU lockless lookup.
This avoids a number of unnecessary nfserr_jukebox returns from
nfsd_file_acquire()
Fixes: 65294c1f2c5e ("nfsd: add a new struct file caching facility to nfsd")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If we go async with a request, grab the creds that the task currently has
assigned and make sure that the async side switches to them. This is
handled in the same way that we do for registered personalities.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ran into a use-after-free on the main io-wq struct, wq. It has a worker
ref and completion event, but the manager itself isn't holding a
reference. This can lead to a race where the manager thinks there are
no workers and exits, but a worker is being added. That leads to the
following trace:
BUG: KASAN: use-after-free in io_wqe_worker+0x4c0/0x5e0
Read of size 8 at addr ffff888108baa8a0 by task iou-wrk-3080422/3080425
CPU: 5 PID: 3080425 Comm: iou-wrk-3080422 Not tainted 5.12.0-rc1+ #110
Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
Call Trace:
dump_stack+0x90/0xbe
print_address_description.constprop.0+0x67/0x28d
? io_wqe_worker+0x4c0/0x5e0
kasan_report.cold+0x7b/0xd4
? io_wqe_worker+0x4c0/0x5e0
__asan_load8+0x6d/0xa0
io_wqe_worker+0x4c0/0x5e0
? io_worker_handle_work+0xc00/0xc00
? recalc_sigpending+0xe5/0x120
? io_worker_handle_work+0xc00/0xc00
? io_worker_handle_work+0xc00/0xc00
ret_from_fork+0x1f/0x30
Allocated by task 3080422:
kasan_save_stack+0x23/0x60
__kasan_kmalloc+0x80/0xa0
kmem_cache_alloc_node_trace+0xa0/0x480
io_wq_create+0x3b5/0x600
io_uring_alloc_task_context+0x13c/0x380
io_uring_add_task_file+0x109/0x140
__x64_sys_io_uring_enter+0x45f/0x660
do_syscall_64+0x32/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 3080422:
kasan_save_stack+0x23/0x60
kasan_set_track+0x20/0x40
kasan_set_free_info+0x24/0x40
__kasan_slab_free+0xe8/0x120
kfree+0xa8/0x400
io_wq_put+0x14a/0x220
io_wq_put_and_exit+0x9a/0xc0
io_uring_clean_tctx+0x101/0x140
__io_uring_files_cancel+0x36e/0x3c0
do_exit+0x169/0x1340
__x64_sys_exit+0x34/0x40
do_syscall_64+0x32/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Have the manager itself hold a reference, and now both drop points drop
and complete if we hit zero, and the manager can unconditionally do a
wait_for_completion() instead of having a race between reading the ref
count and waiting if it was non-zero.
Fixes: fb3a1f6c745c ("io-wq: have manager wait for all workers to exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When doing a large read or write workload we only
very gradually increase the number of credits
which can cause problems with parallelizing large i/o
(I/O ramps up more slowly than it should for large
read/write workloads) especially with multichannel
when the number of credits on the secondary channels
starts out low (e.g. less than about 130) or when
recovering after server throttled back the number
of credit.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
With multichannel, operations like the queries
from "ls -lR" can cause all credits to be used and
errors to be returned since max_credits was not
being set correctly on the secondary channels and
thus the client was requesting 0 credits incorrectly
in some cases (which can lead to not having
enough credits to perform any operation on that
channel).
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
CC: <stable@vger.kernel.org> # v5.8+
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
where sbi->s_mb_prefetch is unsigned integer type.
32 is the maximum allowed power of s_log_groups_per_flex. Following if
check will also trigger UBSAN shift-out-of-bound:
if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {
So I'm checking it against the raw number, perhaps there is another way
to calculate UINT_MAX max power. Also use min_t as to make sure it's
uint type.
[1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
shift exponent 60 is too large for 32-bit type 'int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x137/0x1be lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:148 [inline]
__ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983
Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while
the transaction is started like:
fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
might_alloc include/linux/sched/mm.h:193 [inline]
slab_pre_alloc_hook mm/slab.h:493 [inline]
slab_alloc_node mm/slub.c:2817 [inline]
__kmalloc_node+0x5f/0x430 mm/slub.c:4015
kmalloc_node include/linux/slab.h:575 [inline]
kvmalloc_node+0x61/0xf0 mm/util.c:587
kvmalloc include/linux/mm.h:781 [inline]
ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
This should be impossible since transaction start sets PF_MEMALLOC_NOFS.
Add some assertions to the code to catch if something isn't working as
expected early.
Link: https://lore.kernel.org/linux-ext4/000000000000563a0205bafb7970@google.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210222171626.21884-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
fails at significant rates on the two test scenarios that disable
delayed allocation (ext3conv and data_journal) and force actual block
allocation for the fallocate and pwrite functions in the test. The
failure rate on 5.10 for both ext3conv and data_journal on one test
system typically runs about 85%. On 5.11, the failure rate on ext3conv
sometimes drops to as low as 1% while the rate on data_journal
increases to nearly 100%.
The observed failures are largely due to ext4_should_retry_alloc()
cutting off block allocation retries when s_mb_free_pending (used to
indicate that a transaction in progress will free blocks) is 0.
However, free space is usually available when this occurs during runs
of generic/371. It appears that a thread attempting to allocate
blocks is just missing transaction commits in other threads that
increase the free cluster count and reset s_mb_free_pending while
the allocating thread isn't running. Explicitly testing for free space
availability avoids this race.
The current code uses a post-increment operator in the conditional
expression that determines whether the retry limit has been exceeded.
This means that the conditional expression uses the value of the
retry counter before it's increased, resulting in an extra retry cycle.
The current code actually retries twice before hitting its retry limit
rather than once.
Increasing the retry limit to 3 from the current actual maximum retry
count of 2 in combination with the change described above reduces the
observed failure rate to less that 0.1% on both ext3conv and
data_journal with what should be limited impact on users sensitive to
the overhead caused by retries.
A per filesystem percpu counter exported via sysfs is added to allow
users or developers to track the number of times the retry limit is
exceeded without resorting to debugging methods. This should provide
some insight into worst case retry behavior.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBCYeIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpisOD/9bSFR7gRqO9oIy6/PEveRI4PWDujjcXgRZ
6jxQnfFUrNQsXcXIlHO4HUDG7DVX/isxdk/YVGhVfuKoco/a0XyYAALH5SVy77T+
hDdWCIJBXgxnfAvv+xMBQDEwlz+pdaOLfOVaGMRAp3akuVTBMA+ZE940Lc81kBaU
bTGev+BzPUsUE7n6ebPdhIQDA6LB02e7kaBZsRDwjsABJuD3o4O1jOAtZyqpPRsW
nADvxsrlMxB3RN97iokinBXV426iAQ/nBDYVDVnWpbckD7Ti4f6r2ohku0qEdhZS
XrTF+1mzEqdmvMLl1YQ/GGpH7ReOLHN78aj4BaG49+pryfkaFe50AHr7frGqKLms
DWymTJnpdJSTNT0Z2GRLNrnWHa3YgeuPMdhlIPfihnZBXhZ7p6X5iNpQ69jd93P3
zLXMJ0RKpkl6bmV+Pk4kCqUfz1BV3sUqG9euLdTq+3uBRA0/B5ktPosyH2DGqUYa
n9aEUHslwHUF+Deu/S9RmVzhTjuD0IRbURSeayimFFe71kHhKsHShOKQMUkhu6zQ
AMsQRq9VrWy/3x3C+qpcbEJ3BIqyGLbiQByOBx96kg9Zk14io3GEmSlqZcxbsKTq
/JXjanaEcUwtKKccOC6g+O+G7VlskO9gLi/Fj/x98R92UBEqpEtVZb8MLCdpiLY/
SHJHbC7Fpw==
=w0Sf
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-03-05' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A bit of a mix between fallout from the worker change, cleanups and
reductions now possible from that change, and fixes in general. In
detail:
- Fully serialize manager and worker creation, fixing races due to
that.
- Clean up some naming that had gone stale.
- SQPOLL fixes.
- Fix race condition around task_work rework that went into this
merge window.
- Implement unshare. Used for when the original task does unshare(2)
or setuid/seteuid and friends, drops the original workers and forks
new ones.
- Drop the only remaining piece of state shuffling we had left, which
was cred. Move it into issue instead, and we can drop all of that
code too.
- Kill f_op->flush() usage. That was such a nasty hack that we had
out of necessity, we no longer need it.
- Following from ->flush() removal, we can also drop various bits of
ctx state related to SQPOLL and cancelations.
- Fix an issue with IOPOLL retry, which originally was fallout from a
filemap change (removing iov_iter_revert()), but uncovered an issue
with iovec re-import too late.
- Fix an issue with system suspend.
- Use xchg() for fallback work, instead of cmpxchg().
- Properly destroy io-wq on exec.
- Add create_io_thread() core helper, and use that in io-wq and
io_uring. This allows us to remove various silly completion events
related to thread setup.
- A few error handling fixes.
This should be the grunt of fixes necessary for the new workers, next
week should be quieter. We've got a pending series from Pavel on
cancelations, and how tasks and rings are indexed. Outside of that,
should just be minor fixes. Even with these fixes, we're still killing
a net ~80 lines"
* tag 'io_uring-5.12-2021-03-05' of git://git.kernel.dk/linux-block: (41 commits)
io_uring: don't restrict issue_flags for io_openat
io_uring: make SQPOLL thread parking saner
io-wq: kill hashed waitqueue before manager exits
io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
io_uring: don't keep looping for more events if we can't flush overflow
io_uring: move to using create_io_thread()
kernel: provide create_io_thread() helper
io_uring: reliably cancel linked timeouts
io_uring: cancel-match based on flags
io-wq: ensure all pending work is canceled on exit
io_uring: ensure that threads freeze on suspend
io_uring: remove extra in_idle wake up
io_uring: inline __io_queue_async_work()
io_uring: inline io_req_clean_work()
io_uring: choose right tctx->io_wq for try cancel
io_uring: fix -EAGAIN retry with IOPOLL
io-wq: fix error path leak of buffered write hash map
io_uring: remove sqo_task
io_uring: kill sqo_dead and sqo submission halting
io_uring: ignore double poll add on the same waitqueue head
...