IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
All requests are now sent with one of the fuse_simple_... helpers. Get rid
of the old api from the fuse internal header.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename fuse_request_send_notify_reply() to fuse_simple_notify_reply() and
convert to passing fuse_args instead of fuse_req.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since we cannot reserve the request structure up-front, make sure that the
request allocation doesn't fail using __GFP_NOFAIL.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Derive fuse_writepage_args from fuse_io_args.
Sending the request is tricky since it was done with fi->lock held, hence
we must either use atomic allocation or release the lock. Both are
possible so try atomic first and if it fails, release the lock and do the
regular allocation with GFP_NOFS and __GFP_NOFAIL. Both flags are
necessary for correct operation.
Move the page realloc function from dev.c to file.c and convert to using
fuse_writepage_args.
The last caller of fuse_write_fill() is gone, so get rid of it.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The old fuse_read_fill() helper can be deleted, now that the last user is
gone.
The fuse_io_args struct is moved to fuse_i.h so it can be shared between
readdir/read code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Need to extend fuse_io_args with 'attr_ver' and 'ff' members, that take the
functionality of the same named members in fuse_req.
fuse_short_read() can now take struct fuse_args_pages.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Change of semantics in fuse_async_req_send/fuse_send_(read|write): these
can now return error, in which case the 'end' callback isn't called, so the
fuse_io_args object needs to be freed.
Added verification that the return value is sane (less than or equal to the
requested read/write size).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Create a helper named fuse_simple_background() that is similar to
fuse_simple_request(). Unlike the latter, it returns immediately and calls
the supplied 'end' callback when the reply is received.
The supplied 'args' pointer is stored in 'fuse_req' which allows the
callback to interpret the output arguments decoded from the reply.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Extract a fuse_write_flags() helper that converts ki_flags relevant write
to open flags.
The other parts of fuse_send_write() aren't used in the
fuse_perform_write() case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Derive fuse_io_args from struct fuse_args_pages. This will be used for
both synchronous and asynchronous read/write requests.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This will allow the use of this function when converting to the simple api
(which doesn't use fuse_req).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_simple_request() is converted to return length of last (instead of
single) out arg, since FUSE_IOCTL_OUT has two out args, the second of which
is variable length.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_req_pages_alloc() is moved to file.c, since its internal use by the
device code will eventually be removed.
Rename to fuse_pages_alloc() to signify that it's not only usable for
fuse_req page array.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Derive fuse_args_pages from fuse_args. This is used to handle requests
which use pages for input or output. The related flags are added to
fuse_args.
New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in
fuse_req need to be freed by fuse_put_request() or not.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We can use the "force" flag to make sure the DESTROY request is always sent
to userspace. So no need to keep it allocated during the lifetime of the
filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In some cases it makes no sense to set pid/uid/gid fields in the request
header. Allow fuse_simple_background() to omit these. This is only
required in the "force" case, so for now just WARN if set otherwise.
Fold fuse_get_req_nofail_nopages() into its only caller. Comment is
obsolete anyway.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This will be used by fuse_force_forget().
We can expand fuse_request_send() into fuse_simple_request(). The
FR_WAITING bit has already been set, no need to check.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate
the request in that case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of complex games with a reserved request, just use __GFP_NOFAIL.
Both calers (flush, readdir) guarantee that connection was already
initialized, so no need to wait for fc->initialized.
Also remove unneeded clearing of FR_BACKGROUND flag.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
...to make future expansion simpler. The hiearachical structure is a
historical thing that does not serve any practical purpose.
The generated code is excatly the same before and after the patch.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For some applications that end up using a submit-and-wait type of
approach for certain batches of IO, we can make that a bit more
efficient by allowing the application to block for the last IO
submission. This prevents an async when we don't need it, as the
application will be blocking for the completion event(s) anyway.
Typical use cases are using the liburing
io_uring_submit_and_wait() API, or just using io_uring_enter()
doing both submissions and completions. As a specific example,
RocksDB doing MultiGet() is sped up quite a bit with this
change.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Version 2 upcalls will allow the nfsd to include a hash of the kerberos
principal string in the Cld_Create upcall. If a principal is present in
the svc_cred, then the hash will be included in the Cld_Create upcall.
We attempt to use the svc_cred.cr_raw_principal (which is returned by
gssproxy) first, and then fall back to using the svc_cred.cr_principal
(which is returned by both gssproxy and rpc.svcgssd). Upon a subsequent
restart, the hash will be returned in the Cld_Gracestart downcall and
stored in the reclaim_str_hashtbl so it can be used when handling
reclaim opens.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a "GetVersion" upcall to allow nfsd to determine the maximum upcall
version that the nfsdcld userspace daemon supports. If the daemon
responds with -EOPNOTSUPP, then we know it only supports v1.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If multiple clients are writing to the same file, then due to the fact
we share a single file descriptor between all NFSv3 clients writing
to the file, we have a situation where clients can miss the fact that
their file data was not persisted. While this should be rare, it
could cause silent data loss in situations where multiple clients
are using NLM locking or O_DIRECT to write to the same file.
Unfortunately, the stateless nature of NFSv3 and the fact that we
can only identify clients by their IP address means that we cannot
trivially cache errors; we would not know when it is safe to
release them from the cache.
So the solution is to declare a reboot. We understand that this
should be a rare occurrence, since disks are usually stable. The
most frequent occurrence is likely to be ENOSPC, at which point
all writes to the given filesystem are likely to fail anyway.
So the expectation is that clients will be forced to retry their
writes until they hit the fatal error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a file may contain unstable writes that can error out, then we want
to avoid garbage collecting the struct nfsd_file that may be
tracking those errors.
So in the garbage collector, we try to avoid collecting files that aren't
clean. Furthermore, we avoid immediately kicking off the garbage collector
in the case where the reference drops to zero for the case where there
is a write error that is being tracked.
If the file is unhashed while an error is pending, then declare a
reboot, to ensure the client resends any unstable writes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add support to allow the server to reset the boot verifier in order to
force clients to resend I/O after a timeout failure.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Ensure that we can safely clear out the file cache entries when the
nfs server is shut down on a container. Otherwise, the file cache
may end up pinning the mounts.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
To support the link with drain, we need to do two parts.
There is an sqes:
0 1 2 3 4 5 6
+-----+-----+-----+-----+-----+-----+-----+
| N | L | L | L+D | N | N | N |
+-----+-----+-----+-----+-----+-----+-----+
First, we need to ensure that the io before the link is completed,
there is a easy way is set drain flag to the link list's head, so
all subsequent io will be inserted into the defer_list.
+-----+
(0) | N |
+-----+
| (2) (3) (4)
+-----+ +-----+ +-----+ +-----+
(1) | L+D | --> | L | --> | L+D | --> | N |
+-----+ +-----+ +-----+ +-----+
|
+-----+
(5) | N |
+-----+
|
+-----+
(6) | N |
+-----+
Second, ensure that the following IO will not be completed first,
an easy way is to create a mirror of drain io and insert it into
defer_list, in this way, as long as drain io is not processed, the
following io in the defer_list will not be actively process.
+-----+
(0) | N |
+-----+
| (2) (3) (4)
+-----+ +-----+ +-----+ +-----+
(1) | L+D | --> | L | --> | L+D | --> | N |
+-----+ +-----+ +-----+ +-----+
|
+-----+
('3) | D | <== This is a shadow of (3)
+-----+
|
+-----+
(5) | N |
+-----+
|
+-----+
(6) | N |
+-----+
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sqo_thread will get sqring in batches, which will cause
ctx->cached_sq_head to be added in batches. if one of these
sqes is set with the DRAIN flag, then he will never get a
chance to process, and finally sqo_thread will not exit.
Fixes: de0617e4671 ("io_uring: add support for marking commands as draining")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When doing any form of incremental send the parent and the child trees
need to be compared via btrfs_compare_trees. This can result in long
loop chains without ever relinquishing the CPU. This causes softlockup
detector to trigger when comparing trees with a lot of items. Example
report:
watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
sp : ffff00001273b7e0
Call trace:
__ll_sc_arch_atomic_sub_return+0x14/0x20
release_extent_buffer+0xdc/0x120 [btrfs]
free_extent_buffer.part.0+0xb0/0x118 [btrfs]
free_extent_buffer+0x24/0x30 [btrfs]
btrfs_release_path+0x4c/0xa0 [btrfs]
btrfs_free_path.part.0+0x20/0x40 [btrfs]
btrfs_free_path+0x24/0x30 [btrfs]
get_inode_info+0xa8/0xf8 [btrfs]
finish_inode_if_needed+0xe0/0x6d8 [btrfs]
changed_cb+0x9c/0x410 [btrfs]
btrfs_compare_trees+0x284/0x648 [btrfs]
send_subvol+0x33c/0x520 [btrfs]
btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
btrfs_ioctl+0x199c/0x2288 [btrfs]
do_vfs_ioctl+0x4b0/0x820
ksys_ioctl+0x84/0xb8
__arm64_sys_ioctl+0x28/0x38
el0_svc_common.constprop.0+0x7c/0x188
el0_svc_handler+0x34/0x90
el0_svc+0x8/0xc
Fix this by adding a call to cond_resched at the beginning of the main
loop in btrfs_compare_trees.
Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those function are simple boolean predicates there is no need to assign
their return values to interim variables. Use them directly as
predicates. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a structure to encode the type and length for the known on-disk
checksums. This makes it easier to add new checksums later.
The structure and helpers are moved from ctree.h so they don't occupy
space in all headers including ctree.h. This save some space in the
final object.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging weird enospc problems it's handy to be able to dump the
space info when we wake up all tickets, and see what the ticket values
are. This helped me figure out cases where we were enospc'ing when we
shouldn't have been.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a problem in production where a box with plenty of space was
getting wedged doing ENOSPC flushing. These boxes only had 20% of the
disk allocated, but their metadata space + global reserve was right at
the size of their metadata chunk.
In this case can_overcommit should be allowing allocations without
problem, but there's logic in can_overcommit that doesn't allow us to
overcommit if there's not enough real space to satisfy the global
reserve.
This is for historical reasons. Before there were only certain places
we could allocate chunks. We could go to commit the transaction and not
have enough space for our pending delayed refs and such and be unable to
allocate a new chunk. This would result in a abort because of ENOSPC.
This code was added to solve this problem.
However since then we've gained the ability to always be able to
allocate a chunk. So we can easily overcommit in these cases without
risking a transaction abort because of ENOSPC.
Also prior to now the global reserve really would be used because that's
the space we relied on for delayed refs. With delayed refs being
tracked separately we no longer have to worry about running out of
delayed refs space while committing. We are much less likely to
exhaust our global reserve space during transaction commit.
Fix the can_overcommit code to simply see if our current usage + what we
want is less than our current free space plus whatever slack space we
have in the disk is. This solves the problem we were seeing in
production and keeps us from flushing as aggressively as we approach our
actual metadata size usage.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have some annoying xfstests tests that will create a very small fs,
fill it up, delete it, and repeat to make sure everything works right.
This trips btrfs up sometimes because we may commit a transaction to
free space, but most of the free metadata space was being reserved by
the global reserve. So we commit and update the global reserve, but the
space is simply added to bytes_may_use directly, instead of trying to
add it to existing tickets. This results in ENOSPC when we really did
have space. Fix this by calling btrfs_try_granting_tickets once we add
back our excess space to wake any pending tickets.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While messing with the overcommit logic I noticed that sometimes we'd
ENOSPC out when really we should have run out of space much earlier. It
turns out it's because we'll only reserve up to the free amount left in
the space info for the global reserve, but that doesn't make sense with
overcommit because we could be well above our actual size. This results
in the global reserve not carving out it's entire reservation, and thus
not putting enough pressure on the rest of the infrastructure to do the
right thing and ENOSPC out at a convenient time. Fix this by always
taking our full reservation amount for the global reserve.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It made sense to have the global reserve set at 16M in the past, but
since it is used less nowadays set the minimum size to the number of
items we'll need to update the main trees we update during a transaction
commit, plus some slop area so we can do unlinks if we need to.
In practice this doesn't affect normal file systems, but for xfstests
where we do things like fill up a fs and then rm * it can fall over in
weird ways. This enables us for more sane behavior at extremely small
file system sizes.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This name doesn't really fit with how the space reservation stuff works
now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
the function is doing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we do not do partial filling of tickets simply remove
orig_bytes, it is no longer needed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we aren't partially filling tickets we may have some slack
space left in the space_info. We need to account for this in
may_commit_transaction, otherwise we may choose to not commit the
transaction despite it actually having enough space to satisfy our
ticket.
Calculate the free space we have in the space_info, if any, and subtract
this from the ticket we have and use that amount to determine if we will
need to commit to reclaim enough space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we no longer partially fill tickets we need to rework
wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
if any subsequent tickets are able to be satisfied. If our tickets_id
changes we know something happened and we can keep flushing.
Also if we find a ticket that is smaller than the first ticket in our
queue then we want to retry the flushing loop again in case
may_commit_transaction() decides we could satisfy the ticket by
committing the transaction.
Rename this to maybe_fail_all_tickets() while we're at it, to better
reflect what the function is actually doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_space_info_add_old_bytes simply checks if we can make the
reservation and updates bytes_may_use, there's no reason to have both
helpers in place.
Factor out the ticket wakeup logic into it's own helper, make
btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
with the wakeup helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>