2005-09-10 00:10:27 +04:00
/*
FUSE : Filesystem in Userspace
2008-11-26 14:03:54 +03:00
Copyright ( C ) 2001 - 2008 Miklos Szeredi < miklos @ szeredi . hu >
2005-09-10 00:10:27 +04:00
This program can be distributed under the terms of the GNU GPL .
See the file COPYING .
*/
# include "fuse_i.h"
# include <linux/init.h>
# include <linux/module.h>
# include <linux/poll.h>
2017-02-02 21:15:33 +03:00
# include <linux/sched/signal.h>
2005-09-10 00:10:27 +04:00
# include <linux/uio.h>
# include <linux/miscdevice.h>
# include <linux/pagemap.h>
# include <linux/file.h>
# include <linux/slab.h>
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
# include <linux/pipe_fs_i.h>
2010-05-25 17:06:07 +04:00
# include <linux/swap.h>
# include <linux/splice.h>
2014-07-03 01:29:19 +04:00
# include <linux/sched.h>
2005-09-10 00:10:27 +04:00
MODULE_ALIAS_MISCDEV ( FUSE_MINOR ) ;
driver core: add devname module aliases to allow module on-demand auto-loading
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-20 20:07:20 +04:00
MODULE_ALIAS ( " devname:fuse " ) ;
2005-09-10 00:10:27 +04:00
2018-09-11 13:11:56 +03:00
/* Ordinary requests have even IDs, while interrupts IDs are odd */
# define FUSE_INT_REQ_BIT (1ULL << 0)
# define FUSE_REQ_ID_STEP (1ULL << 1)
2006-12-07 07:33:20 +03:00
static struct kmem_cache * fuse_req_cachep ;
2005-09-10 00:10:27 +04:00
2015-07-01 17:26:08 +03:00
static struct fuse_dev * fuse_get_dev ( struct file * file )
2005-09-10 00:10:27 +04:00
{
2006-04-11 09:54:55 +04:00
/*
* Lockless access is OK , because file - > private data is set
* once during mount and is valid until the file is released .
*/
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 00:07:29 +03:00
return READ_ONCE ( file - > private_data ) ;
2005-09-10 00:10:27 +04:00
}
2020-05-06 18:44:12 +03:00
static void fuse_request_init ( struct fuse_mount * fm , struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
INIT_LIST_HEAD ( & req - > list ) ;
2006-06-25 16:48:54 +04:00
INIT_LIST_HEAD ( & req - > intr_entry ) ;
2005-09-10 00:10:27 +04:00
init_waitqueue_head ( & req - > waitq ) ;
2017-03-03 12:04:04 +03:00
refcount_set ( & req - > count , 1 ) ;
2015-07-01 17:26:01 +03:00
__set_bit ( FR_PENDING , & req - > flags ) ;
2020-05-06 18:44:12 +03:00
req - > fm = fm ;
2005-09-10 00:10:27 +04:00
}
2020-05-06 18:44:12 +03:00
static struct fuse_req * fuse_request_alloc ( struct fuse_mount * fm , gfp_t flags )
2005-09-10 00:10:27 +04:00
{
2018-10-01 11:07:05 +03:00
struct fuse_req * req = kmem_cache_zalloc ( fuse_req_cachep , flags ) ;
2019-09-10 16:04:11 +03:00
if ( req )
2020-05-06 18:44:12 +03:00
fuse_request_init ( fm , req ) ;
2012-10-26 19:48:07 +04:00
2005-09-10 00:10:27 +04:00
return req ;
}
2012-10-26 19:48:07 +04:00
2019-09-10 16:04:11 +03:00
static void fuse_request_free ( struct fuse_req * req )
2018-10-01 11:07:06 +03:00
{
2005-09-10 00:10:27 +04:00
kmem_cache_free ( fuse_req_cachep , req ) ;
}
2019-09-10 16:04:11 +03:00
static void __fuse_get_request ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
2017-03-03 12:04:04 +03:00
refcount_inc ( & req - > count ) ;
2005-09-10 00:10:27 +04:00
}
/* Must be called with > 1 refcount */
static void __fuse_put_request ( struct fuse_req * req )
{
2017-03-03 12:04:04 +03:00
refcount_dec ( & req - > count ) ;
2005-09-10 00:10:27 +04:00
}
2015-01-06 12:45:35 +03:00
void fuse_set_initialized ( struct fuse_conn * fc )
{
/* Make sure stores before this are seen on another CPU */
smp_wmb ( ) ;
fc - > initialized = 1 ;
}
2013-03-21 18:02:28 +04:00
static bool fuse_block_alloc ( struct fuse_conn * fc , bool for_background )
{
return ! fc - > initialized | | ( for_background & & fc - > blocked ) ;
}
2018-07-26 17:13:11 +03:00
static void fuse_drop_waiting ( struct fuse_conn * fc )
{
2018-11-09 17:52:16 +03:00
/*
* lockess check of fc - > connected is okay , because atomic_dec_and_test ( )
2021-06-04 04:46:17 +03:00
* provides a memory barrier matched with the one in fuse_wait_aborted ( )
2018-11-09 17:52:16 +03:00
* to ensure no wake - up is missed .
*/
if ( atomic_dec_and_test ( & fc - > num_waiting ) & &
! READ_ONCE ( fc - > connected ) ) {
2018-07-26 17:13:11 +03:00
/* wake up aborters */
wake_up_all ( & fc - > blocked_waitq ) ;
}
}
2020-04-20 18:59:34 +03:00
static void fuse_put_request ( struct fuse_req * req ) ;
2019-09-10 16:04:11 +03:00
2020-05-06 18:44:12 +03:00
static struct fuse_req * fuse_get_req ( struct fuse_mount * fm , bool for_background )
2005-09-10 00:10:27 +04:00
{
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = fm - > fc ;
2006-04-11 09:54:59 +04:00
struct fuse_req * req ;
int err ;
2006-04-11 23:16:09 +04:00
atomic_inc ( & fc - > num_waiting ) ;
2013-03-21 18:02:28 +04:00
if ( fuse_block_alloc ( fc , for_background ) ) {
err = - EINTR ;
2016-07-19 10:08:27 +03:00
if ( wait_event_killable_exclusive ( fc - > blocked_waitq ,
! fuse_block_alloc ( fc , for_background ) ) )
2013-03-21 18:02:28 +04:00
goto out ;
}
2015-01-06 12:45:35 +03:00
/* Matches smp_wmb() in fuse_set_initialized() */
smp_rmb ( ) ;
2006-04-11 09:54:59 +04:00
2006-06-25 16:48:50 +04:00
err = - ENOTCONN ;
if ( ! fc - > connected )
goto out ;
2015-07-01 17:25:57 +03:00
err = - ECONNREFUSED ;
if ( fc - > conn_error )
goto out ;
2020-05-06 18:44:12 +03:00
req = fuse_request_alloc ( fm , GFP_KERNEL ) ;
2006-04-11 23:16:09 +04:00
err = - ENOMEM ;
2013-03-21 18:02:36 +04:00
if ( ! req ) {
if ( for_background )
wake_up ( & fc - > blocked_waitq ) ;
2006-04-11 23:16:09 +04:00
goto out ;
2013-03-21 18:02:36 +04:00
}
2005-09-10 00:10:27 +04:00
2018-02-21 20:18:07 +03:00
req - > in . h . uid = from_kuid ( fc - > user_ns , current_fsuid ( ) ) ;
req - > in . h . gid = from_kgid ( fc - > user_ns , current_fsgid ( ) ) ;
2018-02-21 19:52:06 +03:00
req - > in . h . pid = pid_nr_ns ( task_pid ( current ) , fc - > pid_ns ) ;
2015-07-01 17:25:58 +03:00
__set_bit ( FR_WAITING , & req - > flags ) ;
if ( for_background )
__set_bit ( FR_BACKGROUND , & req - > flags ) ;
2018-02-21 19:52:06 +03:00
if ( unlikely ( req - > in . h . uid = = ( ( uid_t ) - 1 ) | |
req - > in . h . gid = = ( ( gid_t ) - 1 ) ) ) {
2020-04-20 18:59:34 +03:00
fuse_put_request ( req ) ;
2018-02-21 19:52:06 +03:00
return ERR_PTR ( - EOVERFLOW ) ;
}
2005-09-10 00:10:27 +04:00
return req ;
2006-04-11 23:16:09 +04:00
out :
2018-07-26 17:13:11 +03:00
fuse_drop_waiting ( fc ) ;
2006-04-11 23:16:09 +04:00
return ERR_PTR ( err ) ;
2005-09-10 00:10:27 +04:00
}
2013-03-21 18:02:04 +04:00
2020-04-20 18:59:34 +03:00
static void fuse_put_request ( struct fuse_req * req )
2006-02-05 10:27:40 +03:00
{
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = req - > fm - > fc ;
2020-04-20 18:59:34 +03:00
2017-03-03 12:04:04 +03:00
if ( refcount_dec_and_test ( & req - > count ) ) {
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_BACKGROUND , & req - > flags ) ) {
2013-03-21 18:02:36 +04:00
/*
* We get here in the unlikely case that a background
* request was allocated but not sent
*/
2018-08-27 18:29:46 +03:00
spin_lock ( & fc - > bg_lock ) ;
2013-03-21 18:02:36 +04:00
if ( ! fc - > blocked )
wake_up ( & fc - > blocked_waitq ) ;
2018-08-27 18:29:46 +03:00
spin_unlock ( & fc - > bg_lock ) ;
2013-03-21 18:02:36 +04:00
}
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_WAITING , & req - > flags ) ) {
__clear_bit ( FR_WAITING , & req - > flags ) ;
2018-07-26 17:13:11 +03:00
fuse_drop_waiting ( fc ) ;
2015-07-01 17:25:56 +03:00
}
2006-06-25 16:48:52 +04:00
2019-09-10 16:04:08 +03:00
fuse_request_free ( req ) ;
2006-02-05 10:27:40 +03:00
}
}
2018-06-21 11:34:25 +03:00
unsigned int fuse_len_args ( unsigned int numargs , struct fuse_arg * args )
2008-02-06 12:38:39 +03:00
{
unsigned nbytes = 0 ;
unsigned i ;
for ( i = 0 ; i < numargs ; i + + )
nbytes + = args [ i ] . size ;
return nbytes ;
}
2018-06-21 11:34:25 +03:00
EXPORT_SYMBOL_GPL ( fuse_len_args ) ;
2008-02-06 12:38:39 +03:00
2018-06-22 15:48:30 +03:00
u64 fuse_get_unique ( struct fuse_iqueue * fiq )
2008-02-06 12:38:39 +03:00
{
2018-09-11 13:11:56 +03:00
fiq - > reqctr + = FUSE_REQ_ID_STEP ;
return fiq - > reqctr ;
2008-02-06 12:38:39 +03:00
}
2018-06-22 15:48:30 +03:00
EXPORT_SYMBOL_GPL ( fuse_get_unique ) ;
2008-02-06 12:38:39 +03:00
2018-09-11 13:12:14 +03:00
static unsigned int fuse_req_hash ( u64 unique )
{
return hash_long ( unique & ~ FUSE_INT_REQ_BIT , FUSE_PQ_HASH_BITS ) ;
}
2018-06-18 17:53:19 +03:00
/**
* A new request is available , wake fiq - > waitq
*/
static void fuse_dev_wake_and_unlock ( struct fuse_iqueue * fiq )
__releases ( fiq - > lock )
{
wake_up ( & fiq - > waitq ) ;
kill_fasync ( & fiq - > fasync , SIGIO , POLL_IN ) ;
spin_unlock ( & fiq - > lock ) ;
}
const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
. wake_forget_and_unlock = fuse_dev_wake_and_unlock ,
. wake_interrupt_and_unlock = fuse_dev_wake_and_unlock ,
. wake_pending_and_unlock = fuse_dev_wake_and_unlock ,
} ;
EXPORT_SYMBOL_GPL ( fuse_dev_fiq_ops ) ;
static void queue_request_and_unlock ( struct fuse_iqueue * fiq ,
struct fuse_req * req )
__releases ( fiq - > lock )
2008-02-06 12:38:39 +03:00
{
req - > in . h . len = sizeof ( struct fuse_in_header ) +
2018-06-21 11:34:25 +03:00
fuse_len_args ( req - > args - > in_numargs ,
( struct fuse_arg * ) req - > args - > in_args ) ;
2015-07-01 17:26:01 +03:00
list_add_tail ( & req - > list , & fiq - > pending ) ;
2018-06-18 17:53:19 +03:00
fiq - > ops - > wake_pending_and_unlock ( fiq ) ;
2008-02-06 12:38:39 +03:00
}
2010-12-07 22:16:56 +03:00
void fuse_queue_forget ( struct fuse_conn * fc , struct fuse_forget_link * forget ,
u64 nodeid , u64 nlookup )
{
2015-07-01 17:26:01 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2010-12-07 22:16:56 +03:00
forget - > forget_one . nodeid = nodeid ;
forget - > forget_one . nlookup = nlookup ;
2010-12-07 22:16:56 +03:00
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:01 +03:00
if ( fiq - > connected ) {
2015-07-01 17:26:01 +03:00
fiq - > forget_list_tail - > next = forget ;
fiq - > forget_list_tail = forget ;
2018-06-18 17:53:19 +03:00
fiq - > ops - > wake_forget_and_unlock ( fiq ) ;
2011-09-12 11:38:03 +04:00
} else {
kfree ( forget ) ;
2018-06-18 17:53:19 +03:00
spin_unlock ( & fiq - > lock ) ;
2011-09-12 11:38:03 +04:00
}
2010-12-07 22:16:56 +03:00
}
2008-02-06 12:38:39 +03:00
static void flush_bg_queue ( struct fuse_conn * fc )
{
2018-07-31 13:25:25 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2009-07-02 04:28:41 +04:00
while ( fc - > active_background < fc - > max_background & &
2008-02-06 12:38:39 +03:00
! list_empty ( & fc - > bg_queue ) ) {
struct fuse_req * req ;
2018-07-31 13:25:25 +03:00
req = list_first_entry ( & fc - > bg_queue , struct fuse_req , list ) ;
2008-02-06 12:38:39 +03:00
list_del ( & req - > list ) ;
fc - > active_background + + ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:01 +03:00
req - > in . h . unique = fuse_get_unique ( fiq ) ;
2018-06-18 17:53:19 +03:00
queue_request_and_unlock ( fiq , req ) ;
2008-02-06 12:38:39 +03:00
}
}
2005-09-10 00:10:27 +04:00
/*
* This function is called when a request is finished . Either a reply
2006-06-25 16:48:53 +04:00
* has arrived or it was aborted ( and not yet sent ) or some error
2006-01-17 09:14:26 +03:00
* occurred during communication with userspace , or the device file
2006-06-25 16:48:50 +04:00
* was closed . The requester thread is woken up ( if still waiting ) ,
* the ' end ' callback is called if given , else the reference to the
* request is released
2005-09-10 00:10:27 +04:00
*/
2020-04-20 18:59:34 +03:00
void fuse_request_end ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
2020-05-06 18:44:12 +03:00
struct fuse_mount * fm = req - > fm ;
struct fuse_conn * fc = fm - > fc ;
2015-07-01 17:26:02 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2015-07-01 17:26:06 +03:00
2015-07-01 17:26:07 +03:00
if ( test_and_set_bit ( FR_FINISHED , & req - > flags ) )
2018-07-26 17:13:11 +03:00
goto put_request ;
2019-10-21 10:11:40 +03:00
2018-11-08 12:05:25 +03:00
/*
* test_and_set_bit ( ) implies smp_mb ( ) between bit
2021-08-04 14:22:58 +03:00
* changing and below FR_INTERRUPTED check . Pairs with
2018-11-08 12:05:25 +03:00
* smp_mb ( ) from queue_interrupt ( ) .
*/
2021-08-04 14:22:58 +03:00
if ( test_bit ( FR_INTERRUPTED , & req - > flags ) ) {
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2018-11-08 12:05:25 +03:00
list_del_init ( & req - > intr_entry ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2018-11-08 12:05:25 +03:00
}
2015-07-01 17:26:01 +03:00
WARN_ON ( test_bit ( FR_PENDING , & req - > flags ) ) ;
WARN_ON ( test_bit ( FR_SENT , & req - > flags ) ) ;
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_BACKGROUND , & req - > flags ) ) {
2018-08-27 18:29:46 +03:00
spin_lock ( & fc - > bg_lock ) ;
2015-07-01 17:25:58 +03:00
clear_bit ( FR_BACKGROUND , & req - > flags ) ;
2018-09-28 17:43:22 +03:00
if ( fc - > num_background = = fc - > max_background ) {
2006-06-25 16:48:50 +04:00
fc - > blocked = 0 ;
2013-03-21 18:02:36 +04:00
wake_up ( & fc - > blocked_waitq ) ;
2018-09-28 17:43:22 +03:00
} else if ( ! fc - > blocked ) {
/*
* Wake up next waiter , if any . It ' s okay to use
* waitqueue_active ( ) , as we ' ve already synced up
* fc - > blocked with waiters with the wake_up ( ) call
* above .
*/
if ( waitqueue_active ( & fc - > blocked_waitq ) )
wake_up ( & fc - > blocked_waitq ) ;
}
2013-03-21 18:02:36 +04:00
2006-06-25 16:48:50 +04:00
fc - > num_background - - ;
2008-02-06 12:38:39 +03:00
fc - > active_background - - ;
flush_bg_queue ( fc ) ;
2018-08-27 18:29:46 +03:00
spin_unlock ( & fc - > bg_lock ) ;
2018-11-08 12:05:31 +03:00
} else {
/* Wake up waiter sleeping in request_wait_answer() */
wake_up ( & req - > waitq ) ;
2005-09-10 00:10:27 +04:00
}
2018-11-08 12:05:31 +03:00
2020-02-13 11:16:07 +03:00
if ( test_bit ( FR_ASYNC , & req - > flags ) )
2020-05-06 18:44:12 +03:00
req - > args - > end ( fm , req - > args , req - > out . h . error ) ;
2018-07-26 17:13:11 +03:00
put_request :
2020-04-20 18:59:34 +03:00
fuse_put_request ( req ) ;
2005-09-10 00:10:27 +04:00
}
2018-06-21 11:33:40 +03:00
EXPORT_SYMBOL_GPL ( fuse_request_end ) ;
2005-09-10 00:10:27 +04:00
2020-04-20 18:59:34 +03:00
static int queue_interrupt ( struct fuse_req * req )
2006-06-25 16:48:54 +04:00
{
2020-05-06 18:44:12 +03:00
struct fuse_iqueue * fiq = & req - > fm - > fc - > iq ;
2020-04-20 18:59:34 +03:00
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2018-11-08 12:05:42 +03:00
/* Check for we've sent request to interrupt this req */
if ( unlikely ( ! test_bit ( FR_INTERRUPTED , & req - > flags ) ) ) {
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2018-11-08 12:05:42 +03:00
return - EINVAL ;
}
2015-07-01 17:26:03 +03:00
if ( list_empty ( & req - > intr_entry ) ) {
list_add_tail ( & req - > intr_entry , & fiq - > interrupts ) ;
2018-11-08 12:05:25 +03:00
/*
* Pairs with smp_mb ( ) implied by test_and_set_bit ( )
2020-02-28 15:15:24 +03:00
* from fuse_request_end ( ) .
2018-11-08 12:05:25 +03:00
*/
smp_mb ( ) ;
if ( test_bit ( FR_FINISHED , & req - > flags ) ) {
list_del_init ( & req - > intr_entry ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2018-11-08 12:05:42 +03:00
return 0 ;
2018-11-08 12:05:25 +03:00
}
2018-06-18 17:53:19 +03:00
fiq - > ops - > wake_interrupt_and_unlock ( fiq ) ;
} else {
spin_unlock ( & fiq - > lock ) ;
2015-07-01 17:26:03 +03:00
}
2018-11-08 12:05:42 +03:00
return 0 ;
2006-06-25 16:48:54 +04:00
}
2020-04-20 18:59:34 +03:00
static void request_wait_answer ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = req - > fm - > fc ;
2015-07-01 17:26:02 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2015-07-01 17:26:00 +03:00
int err ;
2006-06-25 16:48:54 +04:00
if ( ! fc - > no_interrupt ) {
/* Any signal may interrupt this */
2015-07-01 17:26:00 +03:00
err = wait_event_interruptible ( req - > waitq ,
2015-07-01 17:26:01 +03:00
test_bit ( FR_FINISHED , & req - > flags ) ) ;
2015-07-01 17:26:00 +03:00
if ( ! err )
2006-06-25 16:48:54 +04:00
return ;
2015-07-01 17:25:58 +03:00
set_bit ( FR_INTERRUPTED , & req - > flags ) ;
2015-07-01 17:26:03 +03:00
/* matches barrier in fuse_dev_do_read() */
smp_mb__after_atomic ( ) ;
2015-07-01 17:26:01 +03:00
if ( test_bit ( FR_SENT , & req - > flags ) )
2020-04-20 18:59:34 +03:00
queue_interrupt ( req ) ;
2006-06-25 16:48:54 +04:00
}
2015-07-01 17:25:58 +03:00
if ( ! test_bit ( FR_FORCE , & req - > flags ) ) {
2006-06-25 16:48:54 +04:00
/* Only fatal signals may interrupt this */
2016-07-19 10:08:27 +03:00
err = wait_event_killable ( req - > waitq ,
2015-07-01 17:26:01 +03:00
test_bit ( FR_FINISHED , & req - > flags ) ) ;
2015-07-01 17:26:00 +03:00
if ( ! err )
2007-10-17 10:31:04 +04:00
return ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2007-10-17 10:31:04 +04:00
/* Request is not yet in userspace, bail out */
2015-07-01 17:26:01 +03:00
if ( test_bit ( FR_PENDING , & req - > flags ) ) {
2007-10-17 10:31:04 +04:00
list_del ( & req - > list ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2007-10-17 10:31:04 +04:00
__fuse_put_request ( req ) ;
req - > out . h . error = - EINTR ;
return ;
}
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2006-06-25 16:48:50 +04:00
}
2005-09-10 00:10:27 +04:00
2007-10-17 10:31:04 +04:00
/*
* Either request is already in userspace , or it was forced .
* Wait it out .
*/
2015-07-01 17:26:01 +03:00
wait_event ( req - > waitq , test_bit ( FR_FINISHED , & req - > flags ) ) ;
2005-09-10 00:10:27 +04:00
}
2020-04-20 18:59:34 +03:00
static void __fuse_request_send ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
2020-05-06 18:44:12 +03:00
struct fuse_iqueue * fiq = & req - > fm - > fc - > iq ;
2015-07-01 17:26:01 +03:00
2015-07-01 17:25:58 +03:00
BUG_ON ( test_bit ( FR_BACKGROUND , & req - > flags ) ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:01 +03:00
if ( ! fiq - > connected ) {
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2005-09-10 00:10:27 +04:00
req - > out . h . error = - ENOTCONN ;
2015-07-01 17:26:00 +03:00
} else {
2015-07-01 17:26:01 +03:00
req - > in . h . unique = fuse_get_unique ( fiq ) ;
2005-09-10 00:10:27 +04:00
/* acquire extra reference, since request is still needed
2018-06-21 11:33:40 +03:00
after fuse_request_end ( ) */
2005-09-10 00:10:27 +04:00
__fuse_get_request ( req ) ;
2018-06-18 17:53:19 +03:00
queue_request_and_unlock ( fiq , req ) ;
2005-09-10 00:10:27 +04:00
2020-04-20 18:59:34 +03:00
request_wait_answer ( req ) ;
2018-06-21 11:33:40 +03:00
/* Pairs with smp_wmb() in fuse_request_end() */
2015-07-01 17:26:00 +03:00
smp_rmb ( ) ;
2005-09-10 00:10:27 +04:00
}
}
2013-02-04 17:04:44 +04:00
2015-01-06 12:45:35 +03:00
static void fuse_adjust_compat ( struct fuse_conn * fc , struct fuse_args * args )
{
2019-09-10 16:04:08 +03:00
if ( fc - > minor < 4 & & args - > opcode = = FUSE_STATFS )
args - > out_args [ 0 ] . size = FUSE_COMPAT_STATFS_SIZE ;
2015-01-06 12:45:35 +03:00
if ( fc - > minor < 9 ) {
2019-09-10 16:04:08 +03:00
switch ( args - > opcode ) {
2015-01-06 12:45:35 +03:00
case FUSE_LOOKUP :
case FUSE_CREATE :
case FUSE_MKNOD :
case FUSE_MKDIR :
case FUSE_SYMLINK :
case FUSE_LINK :
2019-09-10 16:04:08 +03:00
args - > out_args [ 0 ] . size = FUSE_COMPAT_ENTRY_OUT_SIZE ;
2015-01-06 12:45:35 +03:00
break ;
case FUSE_GETATTR :
case FUSE_SETATTR :
2019-09-10 16:04:08 +03:00
args - > out_args [ 0 ] . size = FUSE_COMPAT_ATTR_OUT_SIZE ;
2015-01-06 12:45:35 +03:00
break ;
}
}
if ( fc - > minor < 12 ) {
2019-09-10 16:04:08 +03:00
switch ( args - > opcode ) {
2015-01-06 12:45:35 +03:00
case FUSE_CREATE :
2019-09-10 16:04:08 +03:00
args - > in_args [ 0 ] . size = sizeof ( struct fuse_open_in ) ;
2015-01-06 12:45:35 +03:00
break ;
case FUSE_MKNOD :
2019-09-10 16:04:08 +03:00
args - > in_args [ 0 ] . size = FUSE_COMPAT_MKNOD_IN_SIZE ;
2015-01-06 12:45:35 +03:00
break ;
}
}
}
2020-04-20 18:59:34 +03:00
static void fuse_force_creds ( struct fuse_req * req )
2019-09-10 16:04:08 +03:00
{
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = req - > fm - > fc ;
2020-04-20 18:59:34 +03:00
2019-09-10 16:04:08 +03:00
req - > in . h . uid = from_kuid_munged ( fc - > user_ns , current_fsuid ( ) ) ;
req - > in . h . gid = from_kgid_munged ( fc - > user_ns , current_fsgid ( ) ) ;
req - > in . h . pid = pid_nr_ns ( task_pid ( current ) , fc - > pid_ns ) ;
}
2019-09-23 08:52:31 +03:00
static void fuse_args_to_req ( struct fuse_req * req , struct fuse_args * args )
2019-09-10 16:04:09 +03:00
{
req - > in . h . opcode = args - > opcode ;
req - > in . h . nodeid = args - > nodeid ;
2019-09-10 16:04:11 +03:00
req - > args = args ;
2020-02-13 11:16:07 +03:00
if ( args - > end )
__set_bit ( FR_ASYNC , & req - > flags ) ;
2019-09-10 16:04:09 +03:00
}
2020-05-06 18:44:12 +03:00
ssize_t fuse_simple_request ( struct fuse_mount * fm , struct fuse_args * args )
2014-12-12 11:49:05 +03:00
{
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = fm - > fc ;
2014-12-12 11:49:05 +03:00
struct fuse_req * req ;
ssize_t ret ;
2019-09-10 16:04:08 +03:00
if ( args - > force ) {
2019-09-10 16:04:08 +03:00
atomic_inc ( & fc - > num_waiting ) ;
2020-05-06 18:44:12 +03:00
req = fuse_request_alloc ( fm , GFP_KERNEL | __GFP_NOFAIL ) ;
2019-09-10 16:04:08 +03:00
if ( ! args - > nocreds )
2020-04-20 18:59:34 +03:00
fuse_force_creds ( req ) ;
2019-09-10 16:04:08 +03:00
__set_bit ( FR_WAITING , & req - > flags ) ;
2019-09-10 16:04:08 +03:00
__set_bit ( FR_FORCE , & req - > flags ) ;
} else {
2019-09-10 16:04:08 +03:00
WARN_ON ( args - > nocreds ) ;
2020-05-06 18:44:12 +03:00
req = fuse_get_req ( fm , false ) ;
2019-09-10 16:04:08 +03:00
if ( IS_ERR ( req ) )
return PTR_ERR ( req ) ;
}
2014-12-12 11:49:05 +03:00
2015-01-06 12:45:35 +03:00
/* Needs to be done after fuse_get_req() so that fc->minor is valid */
fuse_adjust_compat ( fc , args ) ;
2019-09-10 16:04:09 +03:00
fuse_args_to_req ( req , args ) ;
2015-01-06 12:45:35 +03:00
2019-09-10 16:04:08 +03:00
if ( ! args - > noreply )
__set_bit ( FR_ISREPLY , & req - > flags ) ;
2020-04-20 18:59:34 +03:00
__fuse_request_send ( req ) ;
2014-12-12 11:49:05 +03:00
ret = req - > out . h . error ;
2019-09-10 16:04:08 +03:00
if ( ! ret & & args - > out_argvar ) {
2019-09-10 16:04:09 +03:00
BUG_ON ( args - > out_numargs = = 0 ) ;
2019-09-10 16:04:11 +03:00
ret = args - > out_args [ args - > out_numargs - 1 ] . size ;
2014-12-12 11:49:05 +03:00
}
2020-04-20 18:59:34 +03:00
fuse_put_request ( req ) ;
2014-12-12 11:49:05 +03:00
return ret ;
}
2020-04-20 18:59:34 +03:00
static bool fuse_request_queue_background ( struct fuse_req * req )
2008-02-06 12:38:39 +03:00
{
2020-05-06 18:44:12 +03:00
struct fuse_mount * fm = req - > fm ;
struct fuse_conn * fc = fm - > fc ;
2018-08-27 18:29:56 +03:00
bool queued = false ;
WARN_ON ( ! test_bit ( FR_BACKGROUND , & req - > flags ) ) ;
2015-07-01 17:25:58 +03:00
if ( ! test_bit ( FR_WAITING , & req - > flags ) ) {
__set_bit ( FR_WAITING , & req - > flags ) ;
2015-07-01 17:25:56 +03:00
atomic_inc ( & fc - > num_waiting ) ;
}
2015-07-01 17:25:58 +03:00
__set_bit ( FR_ISREPLY , & req - > flags ) ;
2018-08-27 18:29:46 +03:00
spin_lock ( & fc - > bg_lock ) ;
2018-08-27 18:29:56 +03:00
if ( likely ( fc - > connected ) ) {
fc - > num_background + + ;
if ( fc - > num_background = = fc - > max_background )
fc - > blocked = 1 ;
list_add_tail ( & req - > list , & fc - > bg_queue ) ;
flush_bg_queue ( fc ) ;
queued = true ;
2008-02-06 12:38:39 +03:00
}
2018-08-27 18:29:46 +03:00
spin_unlock ( & fc - > bg_lock ) ;
2018-08-27 18:29:56 +03:00
return queued ;
2008-02-06 12:38:39 +03:00
}
2020-05-06 18:44:12 +03:00
int fuse_simple_background ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 16:04:10 +03:00
gfp_t gfp_flags )
{
struct fuse_req * req ;
if ( args - > force ) {
WARN_ON ( ! args - > nocreds ) ;
2020-05-06 18:44:12 +03:00
req = fuse_request_alloc ( fm , gfp_flags ) ;
2019-09-10 16:04:10 +03:00
if ( ! req )
return - ENOMEM ;
__set_bit ( FR_BACKGROUND , & req - > flags ) ;
} else {
WARN_ON ( args - > nocreds ) ;
2020-05-06 18:44:12 +03:00
req = fuse_get_req ( fm , true ) ;
2019-09-10 16:04:10 +03:00
if ( IS_ERR ( req ) )
return PTR_ERR ( req ) ;
}
fuse_args_to_req ( req , args ) ;
2020-04-20 18:59:34 +03:00
if ( ! fuse_request_queue_background ( req ) ) {
fuse_put_request ( req ) ;
2019-09-10 16:04:10 +03:00
return - ENOTCONN ;
}
return 0 ;
}
EXPORT_SYMBOL_GPL ( fuse_simple_background ) ;
2020-05-06 18:44:12 +03:00
static int fuse_simple_notify_reply ( struct fuse_mount * fm ,
2019-09-10 16:04:11 +03:00
struct fuse_args * args , u64 unique )
2010-07-12 16:41:40 +04:00
{
2019-09-10 16:04:11 +03:00
struct fuse_req * req ;
2020-05-06 18:44:12 +03:00
struct fuse_iqueue * fiq = & fm - > fc - > iq ;
2019-09-10 16:04:11 +03:00
int err = 0 ;
2020-05-06 18:44:12 +03:00
req = fuse_get_req ( fm , false ) ;
2019-09-10 16:04:11 +03:00
if ( IS_ERR ( req ) )
return PTR_ERR ( req ) ;
2010-07-12 16:41:40 +04:00
2015-07-01 17:25:58 +03:00
__clear_bit ( FR_ISREPLY , & req - > flags ) ;
2010-07-12 16:41:40 +04:00
req - > in . h . unique = unique ;
2019-09-10 16:04:11 +03:00
fuse_args_to_req ( req , args ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:01 +03:00
if ( fiq - > connected ) {
2018-06-18 17:53:19 +03:00
queue_request_and_unlock ( fiq , req ) ;
2019-09-10 16:04:11 +03:00
} else {
err = - ENODEV ;
spin_unlock ( & fiq - > lock ) ;
2020-04-20 18:59:34 +03:00
fuse_put_request ( req ) ;
2010-07-12 16:41:40 +04:00
}
return err ;
}
2005-09-10 00:10:27 +04:00
/*
* Lock the request . Up to the next unlock_request ( ) there mustn ' t be
* anything that could cause a page - fault . If the request was already
2006-06-25 16:48:53 +04:00
* aborted bail out .
2005-09-10 00:10:27 +04:00
*/
2015-07-01 17:25:58 +03:00
static int lock_request ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
int err = 0 ;
if ( req ) {
2015-07-01 17:25:58 +03:00
spin_lock ( & req - > waitq . lock ) ;
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_ABORTED , & req - > flags ) )
2005-09-10 00:10:27 +04:00
err = - ENOENT ;
else
2015-07-01 17:25:58 +03:00
set_bit ( FR_LOCKED , & req - > flags ) ;
2015-07-01 17:25:58 +03:00
spin_unlock ( & req - > waitq . lock ) ;
2005-09-10 00:10:27 +04:00
}
return err ;
}
/*
2015-07-01 17:25:58 +03:00
* Unlock request . If it was aborted while locked , caller is responsible
* for unlocking and ending the request .
2005-09-10 00:10:27 +04:00
*/
2015-07-01 17:25:58 +03:00
static int unlock_request ( struct fuse_req * req )
2005-09-10 00:10:27 +04:00
{
2015-07-01 17:25:58 +03:00
int err = 0 ;
2005-09-10 00:10:27 +04:00
if ( req ) {
2015-07-01 17:25:58 +03:00
spin_lock ( & req - > waitq . lock ) ;
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_ABORTED , & req - > flags ) )
2015-07-01 17:25:58 +03:00
err = - ENOENT ;
else
2015-07-01 17:25:58 +03:00
clear_bit ( FR_LOCKED , & req - > flags ) ;
2015-07-01 17:25:58 +03:00
spin_unlock ( & req - > waitq . lock ) ;
2005-09-10 00:10:27 +04:00
}
2015-07-01 17:25:58 +03:00
return err ;
2005-09-10 00:10:27 +04:00
}
struct fuse_copy_state {
int write ;
struct fuse_req * req ;
2015-04-04 05:06:08 +03:00
struct iov_iter * iter ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
struct pipe_buffer * pipebufs ;
struct pipe_buffer * currbuf ;
struct pipe_inode_info * pipe ;
2005-09-10 00:10:27 +04:00
unsigned long nr_segs ;
struct page * pg ;
unsigned len ;
2014-07-07 17:28:51 +04:00
unsigned offset ;
2010-05-25 17:06:07 +04:00
unsigned move_pages : 1 ;
2005-09-10 00:10:27 +04:00
} ;
2015-07-01 17:25:58 +03:00
static void fuse_copy_init ( struct fuse_copy_state * cs , int write ,
2015-04-04 05:06:08 +03:00
struct iov_iter * iter )
2005-09-10 00:10:27 +04:00
{
memset ( cs , 0 , sizeof ( * cs ) ) ;
cs - > write = write ;
2015-04-04 05:06:08 +03:00
cs - > iter = iter ;
2005-09-10 00:10:27 +04:00
}
/* Unmap and put previous page of userspace buffer */
2006-01-17 09:14:28 +03:00
static void fuse_copy_finish ( struct fuse_copy_state * cs )
2005-09-10 00:10:27 +04:00
{
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
if ( cs - > currbuf ) {
struct pipe_buffer * buf = cs - > currbuf ;
2014-07-07 17:28:51 +04:00
if ( cs - > write )
2010-05-25 17:06:07 +04:00
buf - > len = PAGE_SIZE - cs - > len ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
cs - > currbuf = NULL ;
2014-07-07 17:28:51 +04:00
} else if ( cs - > pg ) {
2005-09-10 00:10:27 +04:00
if ( cs - > write ) {
flush_dcache_page ( cs - > pg ) ;
set_page_dirty_lock ( cs - > pg ) ;
}
put_page ( cs - > pg ) ;
}
2014-07-07 17:28:51 +04:00
cs - > pg = NULL ;
2005-09-10 00:10:27 +04:00
}
/*
* Get another pagefull of userspace buffer , and map it to kernel
* address space , and lock request
*/
static int fuse_copy_fill ( struct fuse_copy_state * cs )
{
2014-07-07 17:28:51 +04:00
struct page * page ;
2005-09-10 00:10:27 +04:00
int err ;
2015-07-01 17:25:58 +03:00
err = unlock_request ( cs - > req ) ;
2015-07-01 17:25:58 +03:00
if ( err )
return err ;
2005-09-10 00:10:27 +04:00
fuse_copy_finish ( cs ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
if ( cs - > pipebufs ) {
struct pipe_buffer * buf = cs - > pipebufs ;
2010-05-25 17:06:07 +04:00
if ( ! cs - > write ) {
2016-09-27 11:45:12 +03:00
err = pipe_buf_confirm ( cs - > pipe , buf ) ;
2010-05-25 17:06:07 +04:00
if ( err )
return err ;
BUG_ON ( ! cs - > nr_segs ) ;
cs - > currbuf = buf ;
2014-07-07 17:28:51 +04:00
cs - > pg = buf - > page ;
cs - > offset = buf - > offset ;
2010-05-25 17:06:07 +04:00
cs - > len = buf - > len ;
cs - > pipebufs + + ;
cs - > nr_segs - - ;
} else {
2019-10-16 18:47:32 +03:00
if ( cs - > nr_segs > = cs - > pipe - > max_usage )
2010-05-25 17:06:07 +04:00
return - EIO ;
page = alloc_page ( GFP_HIGHUSER ) ;
if ( ! page )
return - ENOMEM ;
buf - > page = page ;
buf - > offset = 0 ;
buf - > len = 0 ;
cs - > currbuf = buf ;
2014-07-07 17:28:51 +04:00
cs - > pg = page ;
cs - > offset = 0 ;
2010-05-25 17:06:07 +04:00
cs - > len = PAGE_SIZE ;
cs - > pipebufs + + ;
cs - > nr_segs + + ;
}
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
} else {
2015-04-04 05:06:08 +03:00
size_t off ;
2022-06-09 17:28:36 +03:00
err = iov_iter_get_pages2 ( cs - > iter , & page , PAGE_SIZE , 1 , & off ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
if ( err < 0 )
return err ;
2015-04-04 05:06:08 +03:00
BUG_ON ( ! err ) ;
cs - > len = err ;
cs - > offset = off ;
2014-07-07 17:28:51 +04:00
cs - > pg = page ;
2005-09-10 00:10:27 +04:00
}
2015-07-01 17:25:58 +03:00
return lock_request ( cs - > req ) ;
2005-09-10 00:10:27 +04:00
}
/* Do as much copy to/from userspace buffer as we can */
2006-01-17 09:14:28 +03:00
static int fuse_copy_do ( struct fuse_copy_state * cs , void * * val , unsigned * size )
2005-09-10 00:10:27 +04:00
{
unsigned ncpy = min ( * size , cs - > len ) ;
if ( val ) {
2021-09-08 11:38:28 +03:00
void * pgaddr = kmap_local_page ( cs - > pg ) ;
2014-07-07 17:28:51 +04:00
void * buf = pgaddr + cs - > offset ;
2005-09-10 00:10:27 +04:00
if ( cs - > write )
2014-07-07 17:28:51 +04:00
memcpy ( buf , * val , ncpy ) ;
2005-09-10 00:10:27 +04:00
else
2014-07-07 17:28:51 +04:00
memcpy ( * val , buf , ncpy ) ;
2021-09-08 11:38:28 +03:00
kunmap_local ( pgaddr ) ;
2005-09-10 00:10:27 +04:00
* val + = ncpy ;
}
* size - = ncpy ;
cs - > len - = ncpy ;
2014-07-07 17:28:51 +04:00
cs - > offset + = ncpy ;
2005-09-10 00:10:27 +04:00
return ncpy ;
}
2022-11-01 20:53:23 +03:00
static int fuse_check_folio ( struct folio * folio )
2010-05-25 17:06:07 +04:00
{
2022-11-01 20:53:23 +03:00
if ( folio_mapped ( folio ) | |
folio - > mapping ! = NULL | |
( folio - > flags & PAGE_FLAGS_CHECK_AT_PREP &
2010-05-25 17:06:07 +04:00
~ ( 1 < < PG_locked |
1 < < PG_referenced |
1 < < PG_uptodate |
1 < < PG_lru |
1 < < PG_active |
2021-06-18 22:16:42 +03:00
1 < < PG_workingset |
2020-05-19 15:50:37 +03:00
1 < < PG_reclaim |
mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 11:00:02 +03:00
1 < < PG_waiters |
LRU_GEN_MASK | LRU_REFS_MASK ) ) ) {
2022-11-01 20:53:23 +03:00
dump_page ( & folio - > page , " fuse: trying to steal weird page " ) ;
2010-05-25 17:06:07 +04:00
return 1 ;
}
return 0 ;
}
static int fuse_try_move_page ( struct fuse_copy_state * cs , struct page * * pagep )
{
int err ;
2022-11-01 20:53:23 +03:00
struct folio * oldfolio = page_folio ( * pagep ) ;
struct folio * newfolio ;
2010-05-25 17:06:07 +04:00
struct pipe_buffer * buf = cs - > pipebufs ;
2022-11-01 20:53:23 +03:00
folio_get ( oldfolio ) ;
2015-07-01 17:25:58 +03:00
err = unlock_request ( cs - > req ) ;
2015-07-01 17:25:58 +03:00
if ( err )
2020-09-18 11:36:50 +03:00
goto out_put_old ;
2015-07-01 17:25:58 +03:00
2010-05-25 17:06:07 +04:00
fuse_copy_finish ( cs ) ;
2016-09-27 11:45:12 +03:00
err = pipe_buf_confirm ( cs - > pipe , buf ) ;
2010-05-25 17:06:07 +04:00
if ( err )
2020-09-18 11:36:50 +03:00
goto out_put_old ;
2010-05-25 17:06:07 +04:00
BUG_ON ( ! cs - > nr_segs ) ;
cs - > currbuf = buf ;
cs - > len = buf - > len ;
cs - > pipebufs + + ;
cs - > nr_segs - - ;
if ( cs - > len ! = PAGE_SIZE )
goto out_fallback ;
2020-05-20 18:58:16 +03:00
if ( ! pipe_buf_try_steal ( cs - > pipe , buf ) )
2010-05-25 17:06:07 +04:00
goto out_fallback ;
2022-11-01 20:53:23 +03:00
newfolio = page_folio ( buf - > page ) ;
2010-05-25 17:06:07 +04:00
2022-11-01 20:53:23 +03:00
if ( ! folio_test_uptodate ( newfolio ) )
folio_mark_uptodate ( newfolio ) ;
2010-05-25 17:06:07 +04:00
2022-11-01 20:53:23 +03:00
folio_clear_mappedtodisk ( newfolio ) ;
2010-05-25 17:06:07 +04:00
2022-11-01 20:53:23 +03:00
if ( fuse_check_folio ( newfolio ) ! = 0 )
2010-05-25 17:06:07 +04:00
goto out_fallback_unlock ;
/*
* This is a new and locked page , it shouldn ' t be mapped or
* have any special flags on it
*/
2022-11-01 20:53:23 +03:00
if ( WARN_ON ( folio_mapped ( oldfolio ) ) )
2010-05-25 17:06:07 +04:00
goto out_fallback_unlock ;
2022-11-01 20:53:23 +03:00
if ( WARN_ON ( folio_has_private ( oldfolio ) ) )
2010-05-25 17:06:07 +04:00
goto out_fallback_unlock ;
2022-11-01 20:53:23 +03:00
if ( WARN_ON ( folio_test_dirty ( oldfolio ) | |
folio_test_writeback ( oldfolio ) ) )
2010-05-25 17:06:07 +04:00
goto out_fallback_unlock ;
2022-11-01 20:53:23 +03:00
if ( WARN_ON ( folio_test_mlocked ( oldfolio ) ) )
2010-05-25 17:06:07 +04:00
goto out_fallback_unlock ;
2022-11-01 20:53:23 +03:00
replace_page_cache_folio ( oldfolio , newfolio ) ;
2011-03-23 02:30:52 +03:00
2022-11-01 20:53:23 +03:00
folio_get ( newfolio ) ;
2021-11-25 16:05:18 +03:00
if ( ! ( buf - > flags & PIPE_BUF_FLAG_LRU ) )
2022-11-01 20:53:23 +03:00
folio_add_lru ( newfolio ) ;
2021-11-25 16:05:18 +03:00
2021-11-02 13:10:37 +03:00
/*
* Release while we have extra ref on stolen page . Otherwise
* anon_pipe_buf_release ( ) might think the page can be reused .
*/
pipe_buf_release ( cs - > pipe , buf ) ;
2010-05-25 17:06:07 +04:00
err = 0 ;
2015-07-01 17:25:58 +03:00
spin_lock ( & cs - > req - > waitq . lock ) ;
2015-07-01 17:25:58 +03:00
if ( test_bit ( FR_ABORTED , & cs - > req - > flags ) )
2010-05-25 17:06:07 +04:00
err = - ENOENT ;
else
2022-11-01 20:53:23 +03:00
* pagep = & newfolio - > page ;
2015-07-01 17:25:58 +03:00
spin_unlock ( & cs - > req - > waitq . lock ) ;
2010-05-25 17:06:07 +04:00
if ( err ) {
2022-11-01 20:53:23 +03:00
folio_unlock ( newfolio ) ;
folio_put ( newfolio ) ;
2020-09-18 11:36:50 +03:00
goto out_put_old ;
2010-05-25 17:06:07 +04:00
}
2022-11-01 20:53:23 +03:00
folio_unlock ( oldfolio ) ;
2020-09-18 11:36:50 +03:00
/* Drop ref for ap->pages[] array */
2022-11-01 20:53:23 +03:00
folio_put ( oldfolio ) ;
2010-05-25 17:06:07 +04:00
cs - > len = 0 ;
2020-09-18 11:36:50 +03:00
err = 0 ;
out_put_old :
/* Drop ref obtained in this function */
2022-11-01 20:53:23 +03:00
folio_put ( oldfolio ) ;
2020-09-18 11:36:50 +03:00
return err ;
2010-05-25 17:06:07 +04:00
out_fallback_unlock :
2022-11-01 20:53:23 +03:00
folio_unlock ( newfolio ) ;
2010-05-25 17:06:07 +04:00
out_fallback :
2014-07-07 17:28:51 +04:00
cs - > pg = buf - > page ;
cs - > offset = buf - > offset ;
2010-05-25 17:06:07 +04:00
2015-07-01 17:25:58 +03:00
err = lock_request ( cs - > req ) ;
2020-09-18 11:36:50 +03:00
if ( ! err )
err = 1 ;
2010-05-25 17:06:07 +04:00
2020-09-18 11:36:50 +03:00
goto out_put_old ;
2010-05-25 17:06:07 +04:00
}
2010-05-25 17:06:07 +04:00
static int fuse_ref_page ( struct fuse_copy_state * cs , struct page * page ,
unsigned offset , unsigned count )
{
struct pipe_buffer * buf ;
2015-07-01 17:25:58 +03:00
int err ;
2010-05-25 17:06:07 +04:00
2019-10-16 18:47:32 +03:00
if ( cs - > nr_segs > = cs - > pipe - > max_usage )
2010-05-25 17:06:07 +04:00
return - EIO ;
2020-09-18 11:36:50 +03:00
get_page ( page ) ;
2015-07-01 17:25:58 +03:00
err = unlock_request ( cs - > req ) ;
2020-09-18 11:36:50 +03:00
if ( err ) {
put_page ( page ) ;
2015-07-01 17:25:58 +03:00
return err ;
2020-09-18 11:36:50 +03:00
}
2015-07-01 17:25:58 +03:00
2010-05-25 17:06:07 +04:00
fuse_copy_finish ( cs ) ;
buf = cs - > pipebufs ;
buf - > page = page ;
buf - > offset = offset ;
buf - > len = count ;
cs - > pipebufs + + ;
cs - > nr_segs + + ;
cs - > len = 0 ;
return 0 ;
}
2005-09-10 00:10:27 +04:00
/*
* Copy a page in the request to / from the userspace buffer . Must be
* done atomically
*/
2010-05-25 17:06:07 +04:00
static int fuse_copy_page ( struct fuse_copy_state * cs , struct page * * pagep ,
2006-01-17 09:14:28 +03:00
unsigned offset , unsigned count , int zeroing )
2005-09-10 00:10:27 +04:00
{
2010-05-25 17:06:07 +04:00
int err ;
struct page * page = * pagep ;
2010-10-27 01:22:27 +04:00
if ( page & & zeroing & & count < PAGE_SIZE )
clear_highpage ( page ) ;
2005-09-10 00:10:27 +04:00
while ( count ) {
2010-05-25 17:06:07 +04:00
if ( cs - > write & & cs - > pipebufs & & page ) {
2022-03-07 18:30:44 +03:00
/*
* Can ' t control lifetime of pipe buffers , so always
* copy user pages .
*/
if ( cs - > req - > args - > user_pages ) {
err = fuse_copy_fill ( cs ) ;
if ( err )
return err ;
} else {
return fuse_ref_page ( cs , page , offset , count ) ;
}
2010-05-25 17:06:07 +04:00
} else if ( ! cs - > len ) {
2010-05-25 17:06:07 +04:00
if ( cs - > move_pages & & page & &
offset = = 0 & & count = = PAGE_SIZE ) {
err = fuse_try_move_page ( cs , pagep ) ;
if ( err < = 0 )
return err ;
} else {
err = fuse_copy_fill ( cs ) ;
if ( err )
return err ;
}
2008-11-26 14:03:54 +03:00
}
2005-09-10 00:10:27 +04:00
if ( page ) {
2021-09-08 11:38:28 +03:00
void * mapaddr = kmap_local_page ( page ) ;
2005-09-10 00:10:27 +04:00
void * buf = mapaddr + offset ;
offset + = fuse_copy_do ( cs , & buf , & count ) ;
2021-09-08 11:38:28 +03:00
kunmap_local ( mapaddr ) ;
2005-09-10 00:10:27 +04:00
} else
offset + = fuse_copy_do ( cs , NULL , & count ) ;
}
if ( page & & ! cs - > write )
flush_dcache_page ( page ) ;
return 0 ;
}
/* Copy pages in the request to/from userspace buffer */
static int fuse_copy_pages ( struct fuse_copy_state * cs , unsigned nbytes ,
int zeroing )
{
unsigned i ;
struct fuse_req * req = cs - > req ;
2019-09-10 16:04:11 +03:00
struct fuse_args_pages * ap = container_of ( req - > args , typeof ( * ap ) , args ) ;
2005-09-10 00:10:27 +04:00
2019-09-10 16:04:11 +03:00
for ( i = 0 ; i < ap - > num_pages & & ( nbytes | | zeroing ) ; i + + ) {
2010-05-25 17:06:07 +04:00
int err ;
2019-09-10 16:04:11 +03:00
unsigned int offset = ap - > descs [ i ] . offset ;
unsigned int count = min ( nbytes , ap - > descs [ i ] . length ) ;
2010-05-25 17:06:07 +04:00
2019-09-10 16:04:11 +03:00
err = fuse_copy_page ( cs , & ap - > pages [ i ] , offset , count , zeroing ) ;
2005-09-10 00:10:27 +04:00
if ( err )
return err ;
nbytes - = count ;
}
return 0 ;
}
/* Copy a single argument in the request to/from userspace buffer */
static int fuse_copy_one ( struct fuse_copy_state * cs , void * val , unsigned size )
{
while ( size ) {
2008-11-26 14:03:54 +03:00
if ( ! cs - > len ) {
int err = fuse_copy_fill ( cs ) ;
if ( err )
return err ;
}
2005-09-10 00:10:27 +04:00
fuse_copy_do ( cs , & val , & size ) ;
}
return 0 ;
}
/* Copy request arguments to/from userspace buffer */
static int fuse_copy_args ( struct fuse_copy_state * cs , unsigned numargs ,
unsigned argpages , struct fuse_arg * args ,
int zeroing )
{
int err = 0 ;
unsigned i ;
for ( i = 0 ; ! err & & i < numargs ; i + + ) {
struct fuse_arg * arg = & args [ i ] ;
if ( i = = numargs - 1 & & argpages )
err = fuse_copy_pages ( cs , arg - > size , zeroing ) ;
else
err = fuse_copy_one ( cs , arg - > value , arg - > size ) ;
}
return err ;
}
2015-07-01 17:26:01 +03:00
static int forget_pending ( struct fuse_iqueue * fiq )
2010-12-07 22:16:56 +03:00
{
2015-07-01 17:26:01 +03:00
return fiq - > forget_list_head . next ! = NULL ;
2010-12-07 22:16:56 +03:00
}
2015-07-01 17:26:01 +03:00
static int request_pending ( struct fuse_iqueue * fiq )
2006-06-25 16:48:54 +04:00
{
2015-07-01 17:26:01 +03:00
return ! list_empty ( & fiq - > pending ) | | ! list_empty ( & fiq - > interrupts ) | |
forget_pending ( fiq ) ;
2006-06-25 16:48:54 +04:00
}
/*
* Transfer an interrupt request to userspace
*
* Unlike other requests this is assembled on demand , without a need
* to allocate a separate fuse_req structure .
*
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
* Called with fiq - > lock held , releases it
2006-06-25 16:48:54 +04:00
*/
2015-07-01 17:26:03 +03:00
static int fuse_read_interrupt ( struct fuse_iqueue * fiq ,
struct fuse_copy_state * cs ,
2010-05-25 17:06:07 +04:00
size_t nbytes , struct fuse_req * req )
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
__releases ( fiq - > lock )
2006-06-25 16:48:54 +04:00
{
struct fuse_in_header ih ;
struct fuse_interrupt_in arg ;
unsigned reqsize = sizeof ( ih ) + sizeof ( arg ) ;
int err ;
list_del_init ( & req - > intr_entry ) ;
memset ( & ih , 0 , sizeof ( ih ) ) ;
memset ( & arg , 0 , sizeof ( arg ) ) ;
ih . len = reqsize ;
ih . opcode = FUSE_INTERRUPT ;
2018-09-11 13:12:05 +03:00
ih . unique = ( req - > in . h . unique | FUSE_INT_REQ_BIT ) ;
2006-06-25 16:48:54 +04:00
arg . unique = req - > in . h . unique ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2010-05-25 17:06:07 +04:00
if ( nbytes < reqsize )
2006-06-25 16:48:54 +04:00
return - EINVAL ;
2010-05-25 17:06:07 +04:00
err = fuse_copy_one ( cs , & ih , sizeof ( ih ) ) ;
2006-06-25 16:48:54 +04:00
if ( ! err )
2010-05-25 17:06:07 +04:00
err = fuse_copy_one ( cs , & arg , sizeof ( arg ) ) ;
fuse_copy_finish ( cs ) ;
2006-06-25 16:48:54 +04:00
return err ? err : reqsize ;
}
2019-06-05 22:50:43 +03:00
struct fuse_forget_link * fuse_dequeue_forget ( struct fuse_iqueue * fiq ,
unsigned int max ,
unsigned int * countp )
2010-12-07 22:16:56 +03:00
{
2015-07-01 17:26:01 +03:00
struct fuse_forget_link * head = fiq - > forget_list_head . next ;
2010-12-07 22:16:56 +03:00
struct fuse_forget_link * * newhead = & head ;
unsigned count ;
2010-12-07 22:16:56 +03:00
2010-12-07 22:16:56 +03:00
for ( count = 0 ; * newhead ! = NULL & & count < max ; count + + )
newhead = & ( * newhead ) - > next ;
2015-07-01 17:26:01 +03:00
fiq - > forget_list_head . next = * newhead ;
2010-12-07 22:16:56 +03:00
* newhead = NULL ;
2015-07-01 17:26:01 +03:00
if ( fiq - > forget_list_head . next = = NULL )
fiq - > forget_list_tail = & fiq - > forget_list_head ;
2010-12-07 22:16:56 +03:00
2010-12-07 22:16:56 +03:00
if ( countp ! = NULL )
* countp = count ;
return head ;
2010-12-07 22:16:56 +03:00
}
2019-06-05 22:50:43 +03:00
EXPORT_SYMBOL ( fuse_dequeue_forget ) ;
2010-12-07 22:16:56 +03:00
2015-07-01 17:26:03 +03:00
static int fuse_read_single_forget ( struct fuse_iqueue * fiq ,
2010-12-07 22:16:56 +03:00
struct fuse_copy_state * cs ,
size_t nbytes )
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
__releases ( fiq - > lock )
2010-12-07 22:16:56 +03:00
{
int err ;
2019-06-05 22:50:43 +03:00
struct fuse_forget_link * forget = fuse_dequeue_forget ( fiq , 1 , NULL ) ;
2010-12-07 22:16:56 +03:00
struct fuse_forget_in arg = {
2010-12-07 22:16:56 +03:00
. nlookup = forget - > forget_one . nlookup ,
2010-12-07 22:16:56 +03:00
} ;
struct fuse_in_header ih = {
. opcode = FUSE_FORGET ,
2010-12-07 22:16:56 +03:00
. nodeid = forget - > forget_one . nodeid ,
2015-07-01 17:26:01 +03:00
. unique = fuse_get_unique ( fiq ) ,
2010-12-07 22:16:56 +03:00
. len = sizeof ( ih ) + sizeof ( arg ) ,
} ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2010-12-07 22:16:56 +03:00
kfree ( forget ) ;
if ( nbytes < ih . len )
return - EINVAL ;
err = fuse_copy_one ( cs , & ih , sizeof ( ih ) ) ;
if ( ! err )
err = fuse_copy_one ( cs , & arg , sizeof ( arg ) ) ;
fuse_copy_finish ( cs ) ;
if ( err )
return err ;
return ih . len ;
}
2015-07-01 17:26:03 +03:00
static int fuse_read_batch_forget ( struct fuse_iqueue * fiq ,
2010-12-07 22:16:56 +03:00
struct fuse_copy_state * cs , size_t nbytes )
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
__releases ( fiq - > lock )
2010-12-07 22:16:56 +03:00
{
int err ;
unsigned max_forgets ;
unsigned count ;
struct fuse_forget_link * head ;
struct fuse_batch_forget_in arg = { . count = 0 } ;
struct fuse_in_header ih = {
. opcode = FUSE_BATCH_FORGET ,
2015-07-01 17:26:01 +03:00
. unique = fuse_get_unique ( fiq ) ,
2010-12-07 22:16:56 +03:00
. len = sizeof ( ih ) + sizeof ( arg ) ,
} ;
if ( nbytes < ih . len ) {
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2010-12-07 22:16:56 +03:00
return - EINVAL ;
}
max_forgets = ( nbytes - ih . len ) / sizeof ( struct fuse_forget_one ) ;
2019-06-05 22:50:43 +03:00
head = fuse_dequeue_forget ( fiq , max_forgets , & count ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2010-12-07 22:16:56 +03:00
arg . count = count ;
ih . len + = count * sizeof ( struct fuse_forget_one ) ;
err = fuse_copy_one ( cs , & ih , sizeof ( ih ) ) ;
if ( ! err )
err = fuse_copy_one ( cs , & arg , sizeof ( arg ) ) ;
while ( head ) {
struct fuse_forget_link * forget = head ;
if ( ! err ) {
err = fuse_copy_one ( cs , & forget - > forget_one ,
sizeof ( forget - > forget_one ) ) ;
}
head = forget - > next ;
kfree ( forget ) ;
}
fuse_copy_finish ( cs ) ;
if ( err )
return err ;
return ih . len ;
}
2015-07-01 17:26:03 +03:00
static int fuse_read_forget ( struct fuse_conn * fc , struct fuse_iqueue * fiq ,
struct fuse_copy_state * cs ,
2010-12-07 22:16:56 +03:00
size_t nbytes )
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
__releases ( fiq - > lock )
2010-12-07 22:16:56 +03:00
{
2015-07-01 17:26:01 +03:00
if ( fc - > minor < 16 | | fiq - > forget_list_head . next - > next = = NULL )
2015-07-01 17:26:03 +03:00
return fuse_read_single_forget ( fiq , cs , nbytes ) ;
2010-12-07 22:16:56 +03:00
else
2015-07-01 17:26:03 +03:00
return fuse_read_batch_forget ( fiq , cs , nbytes ) ;
2010-12-07 22:16:56 +03:00
}
2005-09-10 00:10:27 +04:00
/*
* Read a single request into the userspace filesystem ' s buffer . This
* function waits until a request is available , then removes it from
* the pending list and copies request data to userspace buffer . If
2006-06-25 16:48:53 +04:00
* no reply is needed ( FORGET ) or request has been aborted or there
* was an error during the copying then it ' s finished by calling
2018-06-21 11:33:40 +03:00
* fuse_request_end ( ) . Otherwise add it to the processing list , and set
2005-09-10 00:10:27 +04:00
* the ' sent ' flag .
*/
2015-07-01 17:26:09 +03:00
static ssize_t fuse_dev_do_read ( struct fuse_dev * fud , struct file * file ,
2010-05-25 17:06:07 +04:00
struct fuse_copy_state * cs , size_t nbytes )
2005-09-10 00:10:27 +04:00
{
2015-07-01 17:26:05 +03:00
ssize_t err ;
2015-07-01 17:26:09 +03:00
struct fuse_conn * fc = fud - > fc ;
2015-07-01 17:26:01 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2015-07-01 17:26:09 +03:00
struct fuse_pqueue * fpq = & fud - > pq ;
2005-09-10 00:10:27 +04:00
struct fuse_req * req ;
2019-09-10 16:04:11 +03:00
struct fuse_args * args ;
2005-09-10 00:10:27 +04:00
unsigned reqsize ;
2018-09-11 13:12:14 +03:00
unsigned int hash ;
2005-09-10 00:10:27 +04:00
fuse: require /dev/fuse reads to have enough buffer capacity (take 2)
[ This retries commit d4b13963f217 ("fuse: require /dev/fuse reads to have
enough buffer capacity"), which was reverted. In this version we require
only `sizeof(fuse_in_header) + sizeof(fuse_write_in)` instead of 4K for
FUSE request header room, because, contrary to libfuse and kernel client
behaviour, GlusterFS actually provides only so much room for request
header. ]
A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem
requests to handle. It does not know in advance what would be that request
as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many
requests are short and retrieve data from the filesystem. However WRITE and
NOTIFY_REPLY write data into filesystem.
Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will ever
issue. After negotiation the contract in between server/client is that the
filesystem server then should queue /dev/fuse sys_read calls with enough
buffer capacity to receive any client request - WRITE in particular, while
FUSE client should not, in particular, send WRITE requests with >
negotiated max_write payload. FUSE client in kernel and libfuse
historically reserve 4K for request header. However an existing filesystem
server - GlusterFS - was found which reserves only 80 bytes for header room
(= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).
Since
`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
`sizeof(fuse_in_header) + sizeof(fuse_read_in)` ==
`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`
is the absolute minimum any sane filesystem should be using for header
room, the contract is that filesystem server should queue sys_reads with
`sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.
If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.
We can make the problem easy to diagnose if we indicate via error return to
filesystem server when it is violating the contract. This should not
practically cause problems because if a filesystem server is using shorter
buffer, writes to it were already very likely to cause EIO, and if the
filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
minimum buffer size.
Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-07-08 20:03:31 +03:00
/*
* Require sane minimum read buffer - that has capacity for fixed part
* of any request header + negotiated max_write room for data .
*
* Historically libfuse reserves 4 K for fixed header room , but e . g .
* GlusterFS reserves only 80 bytes
*
* = ` sizeof ( fuse_in_header ) + sizeof ( fuse_write_in ) `
*
* which is the absolute minimum any sane filesystem should be using
* for header room .
*/
if ( nbytes < max_t ( size_t , FUSE_MIN_READ_BUFFER ,
sizeof ( struct fuse_in_header ) +
sizeof ( struct fuse_write_in ) +
fc - > max_write ) )
return - EINVAL ;
2006-01-06 11:19:40 +03:00
restart :
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
for ( ; ; ) {
spin_lock ( & fiq - > lock ) ;
if ( ! fiq - > connected | | request_pending ( fiq ) )
break ;
spin_unlock ( & fiq - > lock ) ;
2006-04-11 09:54:53 +04:00
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
if ( file - > f_flags & O_NONBLOCK )
return - EAGAIN ;
err = wait_event_interruptible_exclusive ( fiq - > waitq ,
2015-07-01 17:26:03 +03:00
! fiq - > connected | | request_pending ( fiq ) ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
if ( err )
return err ;
}
2015-07-01 17:26:03 +03:00
2017-11-09 23:23:35 +03:00
if ( ! fiq - > connected ) {
2019-01-24 12:40:16 +03:00
err = fc - > aborted ? - ECONNABORTED : - ENODEV ;
2005-09-10 00:10:27 +04:00
goto err_unlock ;
2017-11-09 23:23:35 +03:00
}
2005-09-10 00:10:27 +04:00
2015-07-01 17:26:01 +03:00
if ( ! list_empty ( & fiq - > interrupts ) ) {
req = list_entry ( fiq - > interrupts . next , struct fuse_req ,
2006-06-25 16:48:54 +04:00
intr_entry ) ;
2015-07-01 17:26:03 +03:00
return fuse_read_interrupt ( fiq , cs , nbytes , req ) ;
2006-06-25 16:48:54 +04:00
}
2015-07-01 17:26:01 +03:00
if ( forget_pending ( fiq ) ) {
if ( list_empty ( & fiq - > pending ) | | fiq - > forget_batch - - > 0 )
2015-07-01 17:26:03 +03:00
return fuse_read_forget ( fc , fiq , cs , nbytes ) ;
2010-12-07 22:16:56 +03:00
2015-07-01 17:26:01 +03:00
if ( fiq - > forget_batch < = - 8 )
fiq - > forget_batch = 16 ;
2010-12-07 22:16:56 +03:00
}
2015-07-01 17:26:01 +03:00
req = list_entry ( fiq - > pending . next , struct fuse_req , list ) ;
2015-07-01 17:26:01 +03:00
clear_bit ( FR_PENDING , & req - > flags ) ;
2015-07-01 17:26:02 +03:00
list_del_init ( & req - > list ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2015-07-01 17:26:02 +03:00
2019-09-10 16:04:11 +03:00
args = req - > args ;
reqsize = req - > in . h . len ;
2017-09-12 17:57:53 +03:00
2006-01-06 11:19:40 +03:00
/* If request is too large, reply with an error and restart the read */
2010-05-25 17:06:07 +04:00
if ( nbytes < reqsize ) {
2006-01-06 11:19:40 +03:00
req - > out . h . error = - EIO ;
/* SETXATTR is special, since it may contain too large data */
2019-09-10 16:04:11 +03:00
if ( args - > opcode = = FUSE_SETXATTR )
2006-01-06 11:19:40 +03:00
req - > out . h . error = - E2BIG ;
2020-04-20 18:59:34 +03:00
fuse_request_end ( req ) ;
2006-01-06 11:19:40 +03:00
goto restart ;
2005-09-10 00:10:27 +04:00
}
2015-07-01 17:26:06 +03:00
spin_lock ( & fpq - > lock ) ;
2021-06-22 10:15:35 +03:00
/*
* Must not put request on fpq - > io queue after having been shut down by
* fuse_abort_conn ( )
*/
if ( ! fpq - > connected ) {
req - > out . h . error = err = - ECONNABORTED ;
goto out_end ;
}
2015-07-01 17:26:05 +03:00
list_add ( & req - > list , & fpq - > io ) ;
2015-07-01 17:26:06 +03:00
spin_unlock ( & fpq - > lock ) ;
2010-05-25 17:06:07 +04:00
cs - > req = req ;
2019-09-10 16:04:11 +03:00
err = fuse_copy_one ( cs , & req - > in . h , sizeof ( req - > in . h ) ) ;
2006-01-06 11:19:40 +03:00
if ( ! err )
2019-09-10 16:04:11 +03:00
err = fuse_copy_args ( cs , args - > in_numargs , args - > in_pages ,
( struct fuse_arg * ) args - > in_args , 0 ) ;
2010-05-25 17:06:07 +04:00
fuse_copy_finish ( cs ) ;
2015-07-01 17:26:06 +03:00
spin_lock ( & fpq - > lock ) ;
2015-07-01 17:25:58 +03:00
clear_bit ( FR_LOCKED , & req - > flags ) ;
2015-07-01 17:26:04 +03:00
if ( ! fpq - > connected ) {
2019-01-24 12:40:16 +03:00
err = fc - > aborted ? - ECONNABORTED : - ENODEV ;
2015-07-01 17:26:05 +03:00
goto out_end ;
2007-10-17 10:31:05 +04:00
}
2005-09-10 00:10:27 +04:00
if ( err ) {
2007-10-17 10:31:05 +04:00
req - > out . h . error = - EIO ;
2015-07-01 17:26:05 +03:00
goto out_end ;
2005-09-10 00:10:27 +04:00
}
2015-07-01 17:25:58 +03:00
if ( ! test_bit ( FR_ISREPLY , & req - > flags ) ) {
2015-07-01 17:26:05 +03:00
err = reqsize ;
goto out_end ;
2005-09-10 00:10:27 +04:00
}
2018-09-11 13:12:14 +03:00
hash = fuse_req_hash ( req - > in . h . unique ) ;
list_move_tail ( & req - > list , & fpq - > processing [ hash ] ) ;
2018-09-25 12:28:55 +03:00
__fuse_get_request ( req ) ;
2015-07-01 17:26:05 +03:00
set_bit ( FR_SENT , & req - > flags ) ;
2018-09-28 17:43:22 +03:00
spin_unlock ( & fpq - > lock ) ;
2015-07-01 17:26:05 +03:00
/* matches barrier in request_wait_answer() */
smp_mb__after_atomic ( ) ;
if ( test_bit ( FR_INTERRUPTED , & req - > flags ) )
2020-04-20 18:59:34 +03:00
queue_interrupt ( req ) ;
fuse_put_request ( req ) ;
2015-07-01 17:26:05 +03:00
2005-09-10 00:10:27 +04:00
return reqsize ;
2015-07-01 17:26:05 +03:00
out_end :
2015-07-01 17:26:06 +03:00
if ( ! test_bit ( FR_PRIVATE , & req - > flags ) )
list_del_init ( & req - > list ) ;
2015-07-01 17:26:06 +03:00
spin_unlock ( & fpq - > lock ) ;
2020-04-20 18:59:34 +03:00
fuse_request_end ( req ) ;
2015-07-01 17:26:05 +03:00
return err ;
2005-09-10 00:10:27 +04:00
err_unlock :
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2005-09-10 00:10:27 +04:00
return err ;
}
2015-01-12 07:22:16 +03:00
static int fuse_dev_open ( struct inode * inode , struct file * file )
{
/*
* The fuse device ' s file ' s private_data is used to hold
* the fuse_conn ( ection ) when it is mounted , and is used to
* keep track of whether the file has been mounted already .
*/
file - > private_data = NULL ;
return 0 ;
}
2015-04-04 04:53:39 +03:00
static ssize_t fuse_dev_read ( struct kiocb * iocb , struct iov_iter * to )
2010-05-25 17:06:07 +04:00
{
struct fuse_copy_state cs ;
struct file * file = iocb - > ki_filp ;
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( file ) ;
if ( ! fud )
2010-05-25 17:06:07 +04:00
return - EPERM ;
2022-05-22 21:59:25 +03:00
if ( ! user_backed_iter ( to ) )
2015-04-04 04:53:39 +03:00
return - EINVAL ;
2015-07-01 17:25:58 +03:00
fuse_copy_init ( & cs , 1 , to ) ;
2010-05-25 17:06:07 +04:00
2015-07-01 17:26:09 +03:00
return fuse_dev_do_read ( fud , file , & cs , iov_iter_count ( to ) ) ;
2010-05-25 17:06:07 +04:00
}
static ssize_t fuse_dev_splice_read ( struct file * in , loff_t * ppos ,
struct pipe_inode_info * pipe ,
size_t len , unsigned int flags )
{
2016-09-18 05:56:25 +03:00
int total , ret ;
2010-05-25 17:06:07 +04:00
int page_nr = 0 ;
struct pipe_buffer * bufs ;
struct fuse_copy_state cs ;
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( in ) ;
if ( ! fud )
2010-05-25 17:06:07 +04:00
return - EPERM ;
2019-10-16 18:47:32 +03:00
bufs = kvmalloc_array ( pipe - > max_usage , sizeof ( struct pipe_buffer ) ,
2018-07-17 19:00:34 +03:00
GFP_KERNEL ) ;
2010-05-25 17:06:07 +04:00
if ( ! bufs )
return - ENOMEM ;
2015-07-01 17:25:58 +03:00
fuse_copy_init ( & cs , 1 , NULL ) ;
2010-05-25 17:06:07 +04:00
cs . pipebufs = bufs ;
cs . pipe = pipe ;
2015-07-01 17:26:09 +03:00
ret = fuse_dev_do_read ( fud , in , & cs , len ) ;
2010-05-25 17:06:07 +04:00
if ( ret < 0 )
goto out ;
2019-10-16 18:47:32 +03:00
if ( pipe_occupancy ( pipe - > head , pipe - > tail ) + cs . nr_segs > pipe - > max_usage ) {
2010-05-25 17:06:07 +04:00
ret = - EIO ;
2016-09-18 05:56:25 +03:00
goto out ;
2010-05-25 17:06:07 +04:00
}
2016-09-18 05:56:25 +03:00
for ( ret = total = 0 ; page_nr < cs . nr_segs ; total + = ret ) {
2014-01-22 22:36:57 +04:00
/*
* Need to be careful about this . Having buf - > ops in module
* code can Oops if the buffer persists after module unload .
*/
2016-09-18 05:56:25 +03:00
bufs [ page_nr ] . ops = & nosteal_pipe_buf_ops ;
2017-02-16 17:08:20 +03:00
bufs [ page_nr ] . flags = 0 ;
2016-09-18 05:56:25 +03:00
ret = add_to_pipe ( pipe , & bufs [ page_nr + + ] ) ;
if ( unlikely ( ret < 0 ) )
break ;
2010-05-25 17:06:07 +04:00
}
2016-09-18 05:56:25 +03:00
if ( total )
ret = total ;
2010-05-25 17:06:07 +04:00
out :
for ( ; page_nr < cs . nr_segs ; page_nr + + )
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( bufs [ page_nr ] . page ) ;
2010-05-25 17:06:07 +04:00
2018-07-17 19:00:34 +03:00
kvfree ( bufs ) ;
2010-05-25 17:06:07 +04:00
return ret ;
}
2008-11-26 14:03:55 +03:00
static int fuse_notify_poll ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_poll_wakeup_out outarg ;
2009-01-26 17:00:59 +03:00
int err = - EINVAL ;
2008-11-26 14:03:55 +03:00
if ( size ! = sizeof ( outarg ) )
2009-01-26 17:00:59 +03:00
goto err ;
2008-11-26 14:03:55 +03:00
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
2009-01-26 17:00:59 +03:00
goto err ;
2008-11-26 14:03:55 +03:00
2009-01-26 17:00:59 +03:00
fuse_copy_finish ( cs ) ;
2008-11-26 14:03:55 +03:00
return fuse_notify_poll_wakeup ( fc , & outarg ) ;
2009-01-26 17:00:59 +03:00
err :
fuse_copy_finish ( cs ) ;
return err ;
2008-11-26 14:03:55 +03:00
}
2009-05-31 19:13:57 +04:00
static int fuse_notify_inval_inode ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_inval_inode_out outarg ;
int err = - EINVAL ;
if ( size ! = sizeof ( outarg ) )
goto err ;
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
goto err ;
fuse_copy_finish ( cs ) ;
down_read ( & fc - > killsb ) ;
2020-05-06 18:44:12 +03:00
err = fuse_reverse_inval_inode ( fc , outarg . ino ,
outarg . off , outarg . len ) ;
2009-05-31 19:13:57 +04:00
up_read ( & fc - > killsb ) ;
return err ;
err :
fuse_copy_finish ( cs ) ;
return err ;
}
static int fuse_notify_inval_entry ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_inval_entry_out outarg ;
2009-12-30 13:37:13 +03:00
int err = - ENOMEM ;
char * buf ;
2009-05-31 19:13:57 +04:00
struct qstr name ;
2009-12-30 13:37:13 +03:00
buf = kzalloc ( FUSE_NAME_MAX + 1 , GFP_KERNEL ) ;
if ( ! buf )
goto err ;
err = - EINVAL ;
2009-05-31 19:13:57 +04:00
if ( size < sizeof ( outarg ) )
goto err ;
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
goto err ;
err = - ENAMETOOLONG ;
if ( outarg . namelen > FUSE_NAME_MAX )
goto err ;
2011-08-24 12:20:17 +04:00
err = - EINVAL ;
if ( size ! = sizeof ( outarg ) + outarg . namelen + 1 )
goto err ;
2009-05-31 19:13:57 +04:00
name . name = buf ;
name . len = outarg . namelen ;
err = fuse_copy_one ( cs , buf , outarg . namelen + 1 ) ;
if ( err )
goto err ;
fuse_copy_finish ( cs ) ;
buf [ outarg . namelen ] = 0 ;
down_read ( & fc - > killsb ) ;
2022-10-28 15:25:21 +03:00
err = fuse_reverse_inval_entry ( fc , outarg . parent , 0 , & name , outarg . flags ) ;
2011-12-07 00:50:06 +04:00
up_read ( & fc - > killsb ) ;
kfree ( buf ) ;
return err ;
err :
kfree ( buf ) ;
fuse_copy_finish ( cs ) ;
return err ;
}
static int fuse_notify_delete ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_delete_out outarg ;
int err = - ENOMEM ;
char * buf ;
struct qstr name ;
buf = kzalloc ( FUSE_NAME_MAX + 1 , GFP_KERNEL ) ;
if ( ! buf )
goto err ;
err = - EINVAL ;
if ( size < sizeof ( outarg ) )
goto err ;
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
goto err ;
err = - ENAMETOOLONG ;
if ( outarg . namelen > FUSE_NAME_MAX )
goto err ;
err = - EINVAL ;
if ( size ! = sizeof ( outarg ) + outarg . namelen + 1 )
goto err ;
name . name = buf ;
name . len = outarg . namelen ;
err = fuse_copy_one ( cs , buf , outarg . namelen + 1 ) ;
if ( err )
goto err ;
fuse_copy_finish ( cs ) ;
buf [ outarg . namelen ] = 0 ;
down_read ( & fc - > killsb ) ;
2022-10-28 15:25:21 +03:00
err = fuse_reverse_inval_entry ( fc , outarg . parent , outarg . child , & name , 0 ) ;
2009-05-31 19:13:57 +04:00
up_read ( & fc - > killsb ) ;
2009-12-30 13:37:13 +03:00
kfree ( buf ) ;
2009-05-31 19:13:57 +04:00
return err ;
err :
2009-12-30 13:37:13 +03:00
kfree ( buf ) ;
2009-05-31 19:13:57 +04:00
fuse_copy_finish ( cs ) ;
return err ;
}
2010-07-12 16:41:40 +04:00
static int fuse_notify_store ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_store_out outarg ;
struct inode * inode ;
struct address_space * mapping ;
u64 nodeid ;
int err ;
pgoff_t index ;
unsigned int offset ;
unsigned int num ;
loff_t file_size ;
loff_t end ;
err = - EINVAL ;
if ( size < sizeof ( outarg ) )
goto out_finish ;
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
goto out_finish ;
err = - EINVAL ;
if ( size - sizeof ( outarg ) ! = outarg . size )
goto out_finish ;
nodeid = outarg . nodeid ;
down_read ( & fc - > killsb ) ;
err = - ENOENT ;
2020-05-06 18:44:12 +03:00
inode = fuse_ilookup ( fc , nodeid , NULL ) ;
2010-07-12 16:41:40 +04:00
if ( ! inode )
goto out_up_killsb ;
mapping = inode - > i_mapping ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
index = outarg . offset > > PAGE_SHIFT ;
offset = outarg . offset & ~ PAGE_MASK ;
2010-07-12 16:41:40 +04:00
file_size = i_size_read ( inode ) ;
end = outarg . offset + outarg . size ;
if ( end > file_size ) {
file_size = end ;
2021-10-22 18:03:02 +03:00
fuse_write_update_attr ( inode , file_size , outarg . size ) ;
2010-07-12 16:41:40 +04:00
}
num = outarg . size ;
while ( num ) {
struct page * page ;
unsigned int this_num ;
err = - ENOMEM ;
page = find_or_create_page ( mapping , index ,
mapping_gfp_mask ( mapping ) ) ;
if ( ! page )
goto out_iput ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
this_num = min_t ( unsigned , num , PAGE_SIZE - offset ) ;
2010-07-12 16:41:40 +04:00
err = fuse_copy_page ( cs , & page , offset , this_num , 0 ) ;
2014-01-22 22:36:58 +04:00
if ( ! err & & offset = = 0 & &
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
( this_num = = PAGE_SIZE | | file_size = = end ) )
2010-07-12 16:41:40 +04:00
SetPageUptodate ( page ) ;
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2010-07-12 16:41:40 +04:00
if ( err )
goto out_iput ;
num - = this_num ;
offset = 0 ;
index + + ;
}
err = 0 ;
out_iput :
iput ( inode ) ;
out_up_killsb :
up_read ( & fc - > killsb ) ;
out_finish :
fuse_copy_finish ( cs ) ;
return err ;
}
2019-09-10 16:04:11 +03:00
struct fuse_retrieve_args {
struct fuse_args_pages ap ;
struct fuse_notify_retrieve_in inarg ;
} ;
2020-05-06 18:44:12 +03:00
static void fuse_retrieve_end ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 16:04:11 +03:00
int error )
2010-07-12 16:41:40 +04:00
{
2019-09-10 16:04:11 +03:00
struct fuse_retrieve_args * ra =
container_of ( args , typeof ( * ra ) , ap . args ) ;
release_pages ( ra - > ap . pages , ra - > ap . num_pages ) ;
kfree ( ra ) ;
2010-07-12 16:41:40 +04:00
}
2020-05-06 18:44:12 +03:00
static int fuse_retrieve ( struct fuse_mount * fm , struct inode * inode ,
2010-07-12 16:41:40 +04:00
struct fuse_notify_retrieve_out * outarg )
{
int err ;
struct address_space * mapping = inode - > i_mapping ;
pgoff_t index ;
loff_t file_size ;
unsigned int num ;
unsigned int offset ;
2010-10-01 00:06:21 +04:00
size_t total_len = 0 ;
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
unsigned int num_pages ;
2020-05-06 18:44:12 +03:00
struct fuse_conn * fc = fm - > fc ;
2019-09-10 16:04:11 +03:00
struct fuse_retrieve_args * ra ;
size_t args_size = sizeof ( * ra ) ;
struct fuse_args_pages * ap ;
struct fuse_args * args ;
2010-07-12 16:41:40 +04:00
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
offset = outarg - > offset & ~ PAGE_MASK ;
2012-10-26 19:48:42 +04:00
file_size = i_size_read ( inode ) ;
fuse: retrieve: cap requested size to negotiated max_write
FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header + that
max_write bytes. A filesystem server is free to set its max_write in
anywhere in the range between [1*page, fc->max_pages*page]. In particular
go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
corresponds to 128K. Libfuse also allows users to configure max_write, but
by default presets it to possible maximum.
If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract, will
be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server request
buffer. In turn the filesystem server could get stuck waiting indefinitely
for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
understood by clients as that NOTIFY_REPLY was queued and will be sent
back.
Cap requested size to negotiate max_write to avoid the problem. This
aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages. This
way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
was originally requested.
Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-03-27 13:15:19 +03:00
num = min ( outarg - > size , fc - > max_write ) ;
2012-10-26 19:48:42 +04:00
if ( outarg - > offset > file_size )
num = 0 ;
else if ( outarg - > offset + num > file_size )
num = file_size - outarg - > offset ;
num_pages = ( num + offset + PAGE_SIZE - 1 ) > > PAGE_SHIFT ;
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
num_pages = min ( num_pages , fc - > max_pages ) ;
2012-10-26 19:48:42 +04:00
2019-09-10 16:04:11 +03:00
args_size + = num_pages * ( sizeof ( ap - > pages [ 0 ] ) + sizeof ( ap - > descs [ 0 ] ) ) ;
2010-07-12 16:41:40 +04:00
2019-09-10 16:04:11 +03:00
ra = kzalloc ( args_size , GFP_KERNEL ) ;
if ( ! ra )
return - ENOMEM ;
ap = & ra - > ap ;
ap - > pages = ( void * ) ( ra + 1 ) ;
ap - > descs = ( void * ) ( ap - > pages + num_pages ) ;
args = & ap - > args ;
args - > nodeid = outarg - > nodeid ;
args - > opcode = FUSE_NOTIFY_REPLY ;
args - > in_numargs = 2 ;
args - > in_pages = true ;
args - > end = fuse_retrieve_end ;
2010-07-12 16:41:40 +04:00
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
index = outarg - > offset > > PAGE_SHIFT ;
2010-07-12 16:41:40 +04:00
2019-09-10 16:04:11 +03:00
while ( num & & ap - > num_pages < num_pages ) {
2010-07-12 16:41:40 +04:00
struct page * page ;
unsigned int this_num ;
page = find_get_page ( mapping , index ) ;
if ( ! page )
break ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
this_num = min_t ( unsigned , num , PAGE_SIZE - offset ) ;
2019-09-10 16:04:11 +03:00
ap - > pages [ ap - > num_pages ] = page ;
ap - > descs [ ap - > num_pages ] . offset = offset ;
ap - > descs [ ap - > num_pages ] . length = this_num ;
ap - > num_pages + + ;
2010-07-12 16:41:40 +04:00
2012-09-04 20:45:54 +04:00
offset = 0 ;
2010-07-12 16:41:40 +04:00
num - = this_num ;
total_len + = this_num ;
2011-12-13 13:36:59 +04:00
index + + ;
2010-07-12 16:41:40 +04:00
}
2019-09-10 16:04:11 +03:00
ra - > inarg . offset = outarg - > offset ;
ra - > inarg . size = total_len ;
args - > in_args [ 0 ] . size = sizeof ( ra - > inarg ) ;
args - > in_args [ 0 ] . value = & ra - > inarg ;
args - > in_args [ 1 ] . size = total_len ;
2010-07-12 16:41:40 +04:00
2020-05-06 18:44:12 +03:00
err = fuse_simple_notify_reply ( fm , args , outarg - > notify_unique ) ;
2019-09-10 16:04:11 +03:00
if ( err )
2020-05-06 18:44:12 +03:00
fuse_retrieve_end ( fm , args , err ) ;
2010-07-12 16:41:40 +04:00
return err ;
}
static int fuse_notify_retrieve ( struct fuse_conn * fc , unsigned int size ,
struct fuse_copy_state * cs )
{
struct fuse_notify_retrieve_out outarg ;
2020-05-06 18:44:12 +03:00
struct fuse_mount * fm ;
2010-07-12 16:41:40 +04:00
struct inode * inode ;
2020-05-06 18:44:12 +03:00
u64 nodeid ;
2010-07-12 16:41:40 +04:00
int err ;
err = - EINVAL ;
if ( size ! = sizeof ( outarg ) )
goto copy_finish ;
err = fuse_copy_one ( cs , & outarg , sizeof ( outarg ) ) ;
if ( err )
goto copy_finish ;
fuse_copy_finish ( cs ) ;
down_read ( & fc - > killsb ) ;
err = - ENOENT ;
2020-05-06 18:44:12 +03:00
nodeid = outarg . nodeid ;
2010-07-12 16:41:40 +04:00
2020-05-06 18:44:12 +03:00
inode = fuse_ilookup ( fc , nodeid , & fm ) ;
if ( inode ) {
err = fuse_retrieve ( fm , inode , & outarg ) ;
iput ( inode ) ;
2010-07-12 16:41:40 +04:00
}
up_read ( & fc - > killsb ) ;
return err ;
copy_finish :
fuse_copy_finish ( cs ) ;
return err ;
}
2008-11-26 14:03:55 +03:00
static int fuse_notify ( struct fuse_conn * fc , enum fuse_notify_code code ,
unsigned int size , struct fuse_copy_state * cs )
{
2015-02-26 13:45:47 +03:00
/* Don't try to move pages (yet) */
cs - > move_pages = 0 ;
2008-11-26 14:03:55 +03:00
switch ( code ) {
2008-11-26 14:03:55 +03:00
case FUSE_NOTIFY_POLL :
return fuse_notify_poll ( fc , size , cs ) ;
2009-05-31 19:13:57 +04:00
case FUSE_NOTIFY_INVAL_INODE :
return fuse_notify_inval_inode ( fc , size , cs ) ;
case FUSE_NOTIFY_INVAL_ENTRY :
return fuse_notify_inval_entry ( fc , size , cs ) ;
2010-07-12 16:41:40 +04:00
case FUSE_NOTIFY_STORE :
return fuse_notify_store ( fc , size , cs ) ;
2010-07-12 16:41:40 +04:00
case FUSE_NOTIFY_RETRIEVE :
return fuse_notify_retrieve ( fc , size , cs ) ;
2011-12-07 00:50:06 +04:00
case FUSE_NOTIFY_DELETE :
return fuse_notify_delete ( fc , size , cs ) ;
2008-11-26 14:03:55 +03:00
default :
2009-01-26 17:00:59 +03:00
fuse_copy_finish ( cs ) ;
2008-11-26 14:03:55 +03:00
return - EINVAL ;
}
}
2005-09-10 00:10:27 +04:00
/* Look up request on processing list by unique ID */
2015-07-01 17:26:04 +03:00
static struct fuse_req * request_find ( struct fuse_pqueue * fpq , u64 unique )
2005-09-10 00:10:27 +04:00
{
2018-09-11 13:12:14 +03:00
unsigned int hash = fuse_req_hash ( unique ) ;
2013-07-31 06:50:01 +04:00
struct fuse_req * req ;
2005-09-10 00:10:27 +04:00
2018-09-11 13:12:14 +03:00
list_for_each_entry ( req , & fpq - > processing [ hash ] , list ) {
2018-09-11 13:12:05 +03:00
if ( req - > in . h . unique = = unique )
2005-09-10 00:10:27 +04:00
return req ;
}
return NULL ;
}
2019-09-10 16:04:11 +03:00
static int copy_out_args ( struct fuse_copy_state * cs , struct fuse_args * args ,
2005-09-10 00:10:27 +04:00
unsigned nbytes )
{
unsigned reqsize = sizeof ( struct fuse_out_header ) ;
2018-06-21 11:34:25 +03:00
reqsize + = fuse_len_args ( args - > out_numargs , args - > out_args ) ;
2005-09-10 00:10:27 +04:00
2019-09-10 16:04:11 +03:00
if ( reqsize < nbytes | | ( reqsize > nbytes & & ! args - > out_argvar ) )
2005-09-10 00:10:27 +04:00
return - EINVAL ;
else if ( reqsize > nbytes ) {
2019-09-10 16:04:11 +03:00
struct fuse_arg * lastarg = & args - > out_args [ args - > out_numargs - 1 ] ;
2005-09-10 00:10:27 +04:00
unsigned diffsize = reqsize - nbytes ;
2019-09-10 16:04:11 +03:00
2005-09-10 00:10:27 +04:00
if ( diffsize > lastarg - > size )
return - EINVAL ;
lastarg - > size - = diffsize ;
}
2019-09-10 16:04:11 +03:00
return fuse_copy_args ( cs , args - > out_numargs , args - > out_pages ,
args - > out_args , args - > page_zeroing ) ;
2005-09-10 00:10:27 +04:00
}
/*
* Write a single reply to a request . First the header is copied from
* the write buffer . The request is then searched on the processing
* list by the unique ID found in the header . If found , then remove
* it from the list and copy the rest of the buffer to the request .
2018-06-21 11:33:40 +03:00
* The request is finished by calling fuse_request_end ( ) .
2005-09-10 00:10:27 +04:00
*/
2015-07-01 17:26:09 +03:00
static ssize_t fuse_dev_do_write ( struct fuse_dev * fud ,
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
struct fuse_copy_state * cs , size_t nbytes )
2005-09-10 00:10:27 +04:00
{
int err ;
2015-07-01 17:26:09 +03:00
struct fuse_conn * fc = fud - > fc ;
struct fuse_pqueue * fpq = & fud - > pq ;
2005-09-10 00:10:27 +04:00
struct fuse_req * req ;
struct fuse_out_header oh ;
2018-11-08 12:05:36 +03:00
err = - EINVAL ;
2005-09-10 00:10:27 +04:00
if ( nbytes < sizeof ( struct fuse_out_header ) )
2018-11-08 12:05:36 +03:00
goto out ;
2005-09-10 00:10:27 +04:00
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
err = fuse_copy_one ( cs , & oh , sizeof ( oh ) ) ;
2005-09-10 00:10:27 +04:00
if ( err )
2018-11-08 12:05:36 +03:00
goto copy_finish ;
2008-11-26 14:03:55 +03:00
err = - EINVAL ;
if ( oh . len ! = nbytes )
2018-11-08 12:05:36 +03:00
goto copy_finish ;
2008-11-26 14:03:55 +03:00
/*
* Zero oh . unique indicates unsolicited notification message
* and error contains notification code .
*/
if ( ! oh . unique ) {
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
err = fuse_notify ( fc , oh . error , nbytes - sizeof ( oh ) , cs ) ;
2018-11-08 12:05:36 +03:00
goto out ;
2008-11-26 14:03:55 +03:00
}
2005-09-10 00:10:27 +04:00
err = - EINVAL ;
2021-06-22 10:15:35 +03:00
if ( oh . error < = - 512 | | oh . error > 0 )
2018-11-08 12:05:36 +03:00
goto copy_finish ;
2005-09-10 00:10:27 +04:00
2015-07-01 17:26:06 +03:00
spin_lock ( & fpq - > lock ) ;
2018-11-08 12:05:36 +03:00
req = NULL ;
if ( fpq - > connected )
req = request_find ( fpq , oh . unique & ~ FUSE_INT_REQ_BIT ) ;
2006-01-17 09:14:41 +03:00
2018-11-08 12:05:36 +03:00
err = - ENOENT ;
if ( ! req ) {
spin_unlock ( & fpq - > lock ) ;
goto copy_finish ;
}
2005-09-10 00:10:27 +04:00
2018-09-11 13:12:05 +03:00
/* Is it an interrupt reply ID? */
if ( oh . unique & FUSE_INT_REQ_BIT ) {
2018-09-25 12:52:42 +03:00
__fuse_get_request ( req ) ;
2015-07-01 17:26:06 +03:00
spin_unlock ( & fpq - > lock ) ;
2018-11-08 12:05:36 +03:00
err = 0 ;
if ( nbytes ! = sizeof ( struct fuse_out_header ) )
err = - EINVAL ;
else if ( oh . error = = - ENOSYS )
2006-06-25 16:48:54 +04:00
fc - > no_interrupt = 1 ;
else if ( oh . error = = - EAGAIN )
2020-04-20 18:59:34 +03:00
err = queue_interrupt ( req ) ;
2018-11-08 12:05:36 +03:00
2020-04-20 18:59:34 +03:00
fuse_put_request ( req ) ;
2006-06-25 16:48:54 +04:00
2018-11-08 12:05:36 +03:00
goto copy_finish ;
2006-06-25 16:48:54 +04:00
}
2015-07-01 17:26:01 +03:00
clear_bit ( FR_SENT , & req - > flags ) ;
2015-07-01 17:26:04 +03:00
list_move ( & req - > list , & fpq - > io ) ;
2005-09-10 00:10:27 +04:00
req - > out . h = oh ;
2015-07-01 17:25:58 +03:00
set_bit ( FR_LOCKED , & req - > flags ) ;
2015-07-01 17:26:06 +03:00
spin_unlock ( & fpq - > lock ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
cs - > req = req ;
2019-09-10 16:04:11 +03:00
if ( ! req - > args - > page_replace )
2010-05-25 17:06:07 +04:00
cs - > move_pages = 0 ;
2005-09-10 00:10:27 +04:00
2019-09-10 16:04:11 +03:00
if ( oh . error )
err = nbytes ! = sizeof ( oh ) ? - EINVAL : 0 ;
else
err = copy_out_args ( cs , req - > args , nbytes ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
fuse_copy_finish ( cs ) ;
2005-09-10 00:10:27 +04:00
2015-07-01 17:26:06 +03:00
spin_lock ( & fpq - > lock ) ;
2015-07-01 17:25:58 +03:00
clear_bit ( FR_LOCKED , & req - > flags ) ;
2015-07-01 17:26:04 +03:00
if ( ! fpq - > connected )
2015-07-01 17:25:58 +03:00
err = - ENOENT ;
else if ( err )
2005-09-10 00:10:27 +04:00
req - > out . h . error = - EIO ;
2015-07-01 17:26:06 +03:00
if ( ! test_bit ( FR_PRIVATE , & req - > flags ) )
list_del_init ( & req - > list ) ;
2015-07-01 17:26:06 +03:00
spin_unlock ( & fpq - > lock ) ;
2015-07-01 17:26:07 +03:00
2020-04-20 18:59:34 +03:00
fuse_request_end ( req ) ;
2018-11-08 12:05:36 +03:00
out :
2005-09-10 00:10:27 +04:00
return err ? err : nbytes ;
2018-11-08 12:05:36 +03:00
copy_finish :
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
fuse_copy_finish ( cs ) ;
2018-11-08 12:05:36 +03:00
goto out ;
2005-09-10 00:10:27 +04:00
}
2015-04-04 04:53:39 +03:00
static ssize_t fuse_dev_write ( struct kiocb * iocb , struct iov_iter * from )
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
{
struct fuse_copy_state cs ;
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( iocb - > ki_filp ) ;
if ( ! fud )
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
return - EPERM ;
2022-05-22 21:59:25 +03:00
if ( ! user_backed_iter ( from ) )
2015-04-04 04:53:39 +03:00
return - EINVAL ;
2015-07-01 17:25:58 +03:00
fuse_copy_init ( & cs , 0 , from ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
2015-07-01 17:26:09 +03:00
return fuse_dev_do_write ( fud , & cs , iov_iter_count ( from ) ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
}
static ssize_t fuse_dev_splice_write ( struct pipe_inode_info * pipe ,
struct file * out , loff_t * ppos ,
size_t len , unsigned int flags )
{
2019-11-15 16:30:32 +03:00
unsigned int head , tail , mask , count ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
unsigned nbuf ;
unsigned idx ;
struct pipe_buffer * bufs ;
struct fuse_copy_state cs ;
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
size_t rem ;
ssize_t ret ;
2015-07-01 17:26:08 +03:00
fud = fuse_get_dev ( out ) ;
if ( ! fud )
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
return - EPERM ;
2018-07-17 19:00:33 +03:00
pipe_lock ( pipe ) ;
2019-11-15 16:30:32 +03:00
head = pipe - > head ;
tail = pipe - > tail ;
mask = pipe - > ring_size - 1 ;
count = head - tail ;
bufs = kvmalloc_array ( count , sizeof ( struct pipe_buffer ) , GFP_KERNEL ) ;
2018-07-17 19:00:33 +03:00
if ( ! bufs ) {
pipe_unlock ( pipe ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
return - ENOMEM ;
2018-07-17 19:00:33 +03:00
}
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
nbuf = 0 ;
rem = 0 ;
2019-12-07 00:34:51 +03:00
for ( idx = tail ; idx ! = head & & rem < len ; idx + + )
2019-11-15 16:30:32 +03:00
rem + = pipe - > bufs [ idx & mask ] . len ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
ret = - EINVAL ;
2019-04-06 00:02:10 +03:00
if ( rem < len )
goto out_free ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
rem = len ;
while ( rem ) {
struct pipe_buffer * ibuf ;
struct pipe_buffer * obuf ;
2019-08-19 09:53:50 +03:00
if ( WARN_ON ( nbuf > = count | | tail = = head ) )
goto out_free ;
2019-11-15 16:30:32 +03:00
ibuf = & pipe - > bufs [ tail & mask ] ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
obuf = & bufs [ nbuf ] ;
if ( rem > = ibuf - > len ) {
* obuf = * ibuf ;
ibuf - > ops = NULL ;
2019-11-15 16:30:32 +03:00
tail + + ;
pipe - > tail = tail ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
} else {
2019-04-06 00:02:10 +03:00
if ( ! pipe_buf_get ( pipe , ibuf ) )
goto out_free ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
* obuf = * ibuf ;
obuf - > flags & = ~ PIPE_BUF_FLAG_GIFT ;
obuf - > len = rem ;
ibuf - > offset + = obuf - > len ;
ibuf - > len - = obuf - > len ;
}
nbuf + + ;
rem - = obuf - > len ;
}
pipe_unlock ( pipe ) ;
2015-07-01 17:25:58 +03:00
fuse_copy_init ( & cs , 0 , NULL ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
cs . pipebufs = bufs ;
2015-04-04 05:06:08 +03:00
cs . nr_segs = nbuf ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
cs . pipe = pipe ;
2010-05-25 17:06:07 +04:00
if ( flags & SPLICE_F_MOVE )
cs . move_pages = 1 ;
2015-07-01 17:26:09 +03:00
ret = fuse_dev_do_write ( fud , & cs , len ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
2019-01-12 04:39:05 +03:00
pipe_lock ( pipe ) ;
2019-04-06 00:02:10 +03:00
out_free :
2021-11-02 13:10:37 +03:00
for ( idx = 0 ; idx < nbuf ; idx + + ) {
struct pipe_buffer * buf = & bufs [ idx ] ;
if ( buf - > ops )
pipe_buf_release ( pipe , buf ) ;
}
2019-01-12 04:39:05 +03:00
pipe_unlock ( pipe ) ;
2016-09-27 11:45:12 +03:00
2018-07-17 19:00:34 +03:00
kvfree ( bufs ) ;
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
return ret ;
}
2017-07-03 08:02:18 +03:00
static __poll_t fuse_dev_poll ( struct file * file , poll_table * wait )
2005-09-10 00:10:27 +04:00
{
2018-02-12 01:34:03 +03:00
__poll_t mask = EPOLLOUT | EPOLLWRNORM ;
2015-07-01 17:26:01 +03:00
struct fuse_iqueue * fiq ;
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( file ) ;
if ( ! fud )
2018-02-12 01:34:03 +03:00
return EPOLLERR ;
2005-09-10 00:10:27 +04:00
2015-07-01 17:26:08 +03:00
fiq = & fud - > fc - > iq ;
2015-07-01 17:26:01 +03:00
poll_wait ( file , & fiq - > waitq , wait ) ;
2005-09-10 00:10:27 +04:00
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:01 +03:00
if ( ! fiq - > connected )
2018-02-12 01:34:03 +03:00
mask = EPOLLERR ;
2015-07-01 17:26:01 +03:00
else if ( request_pending ( fiq ) )
2018-02-12 01:34:03 +03:00
mask | = EPOLLIN | EPOLLRDNORM ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_unlock ( & fiq - > lock ) ;
2005-09-10 00:10:27 +04:00
return mask ;
}
2018-11-06 12:15:20 +03:00
/* Abort all requests on the given list (pending or processing) */
2020-04-20 18:59:34 +03:00
static void end_requests ( struct list_head * head )
2005-09-10 00:10:27 +04:00
{
while ( ! list_empty ( head ) ) {
struct fuse_req * req ;
req = list_entry ( head - > next , struct fuse_req , list ) ;
req - > out . h . error = - ECONNABORTED ;
2015-07-01 17:26:01 +03:00
clear_bit ( FR_SENT , & req - > flags ) ;
2015-07-01 17:26:04 +03:00
list_del_init ( & req - > list ) ;
2020-04-20 18:59:34 +03:00
fuse_request_end ( req ) ;
2005-09-10 00:10:27 +04:00
}
}
2011-03-02 03:43:52 +03:00
static void end_polls ( struct fuse_conn * fc )
{
struct rb_node * p ;
p = rb_first ( & fc - > polled_files ) ;
while ( p ) {
struct fuse_file * ff ;
ff = rb_entry ( p , struct fuse_file , polled_node ) ;
wake_up_interruptible_all ( & ff - > poll_wait ) ;
p = rb_next ( p ) ;
}
}
2006-01-17 09:14:41 +03:00
/*
* Abort all requests .
*
2015-07-01 17:25:59 +03:00
* Emergency exit in case of a malicious or accidental deadlock , or just a hung
* filesystem .
*
* The same effect is usually achievable through killing the filesystem daemon
* and all users of the filesystem . The exception is the combination of an
* asynchronous request and the tricky deadlock ( see
2020-04-14 19:48:35 +03:00
* Documentation / filesystems / fuse . rst ) .
2006-01-17 09:14:41 +03:00
*
2015-07-01 17:25:59 +03:00
* Aborting requests under I / O goes as follows : 1 : Separate out unlocked
* requests , they should be finished off immediately . Locked requests will be
* finished after unlock ; see unlock_request ( ) . 2 : Finish off the unlocked
* requests . It is possible that some request will finish before we can . This
* is OK , the request will in that case be removed from the list before we touch
* it .
2006-01-17 09:14:41 +03:00
*/
2019-01-24 12:40:16 +03:00
void fuse_abort_conn ( struct fuse_conn * fc )
2006-01-17 09:14:41 +03:00
{
2015-07-01 17:26:01 +03:00
struct fuse_iqueue * fiq = & fc - > iq ;
2006-04-11 09:54:55 +04:00
spin_lock ( & fc - > lock ) ;
2006-01-17 09:14:41 +03:00
if ( fc - > connected ) {
2015-07-01 17:26:09 +03:00
struct fuse_dev * fud ;
2015-07-01 17:25:59 +03:00
struct fuse_req * req , * next ;
2018-07-26 17:13:12 +03:00
LIST_HEAD ( to_end ) ;
2018-09-11 13:12:14 +03:00
unsigned int i ;
2015-07-01 17:25:59 +03:00
2018-08-27 18:29:56 +03:00
/* Background queuing checks fc->connected under bg_lock */
spin_lock ( & fc - > bg_lock ) ;
2006-01-17 09:14:41 +03:00
fc - > connected = 0 ;
2018-08-27 18:29:56 +03:00
spin_unlock ( & fc - > bg_lock ) ;
2015-01-06 12:45:35 +03:00
fuse_set_initialized ( fc ) ;
2015-07-01 17:26:09 +03:00
list_for_each_entry ( fud , & fc - > devices , entry ) {
struct fuse_pqueue * fpq = & fud - > pq ;
spin_lock ( & fpq - > lock ) ;
fpq - > connected = 0 ;
list_for_each_entry_safe ( req , next , & fpq - > io , list ) {
req - > out . h . error = - ECONNABORTED ;
spin_lock ( & req - > waitq . lock ) ;
set_bit ( FR_ABORTED , & req - > flags ) ;
if ( ! test_bit ( FR_LOCKED , & req - > flags ) ) {
set_bit ( FR_PRIVATE , & req - > flags ) ;
2018-07-26 17:13:11 +03:00
__fuse_get_request ( req ) ;
2018-07-26 17:13:12 +03:00
list_move ( & req - > list , & to_end ) ;
2015-07-01 17:26:09 +03:00
}
spin_unlock ( & req - > waitq . lock ) ;
2015-07-01 17:26:06 +03:00
}
2018-09-11 13:12:14 +03:00
for ( i = 0 ; i < FUSE_PQ_HASH_SIZE ; i + + )
list_splice_tail_init ( & fpq - > processing [ i ] ,
& to_end ) ;
2015-07-01 17:26:09 +03:00
spin_unlock ( & fpq - > lock ) ;
2015-07-01 17:25:59 +03:00
}
2018-08-27 18:29:46 +03:00
spin_lock ( & fc - > bg_lock ) ;
fc - > blocked = 0 ;
2015-07-01 17:25:59 +03:00
fc - > max_background = UINT_MAX ;
flush_bg_queue ( fc ) ;
2018-08-27 18:29:46 +03:00
spin_unlock ( & fc - > bg_lock ) ;
2015-07-01 17:26:02 +03:00
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
spin_lock ( & fiq - > lock ) ;
2015-07-01 17:26:02 +03:00
fiq - > connected = 0 ;
2018-07-26 17:13:12 +03:00
list_for_each_entry ( req , & fiq - > pending , list )
2017-01-12 23:04:04 +03:00
clear_bit ( FR_PENDING , & req - > flags ) ;
2018-07-26 17:13:12 +03:00
list_splice_tail_init ( & fiq - > pending , & to_end ) ;
2015-07-01 17:26:02 +03:00
while ( forget_pending ( fiq ) )
2019-06-05 22:50:43 +03:00
kfree ( fuse_dequeue_forget ( fiq , 1 , NULL ) ) ;
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
This may have to wait for fuse_iqueue::waitq.lock to be released by one
of many places that take it with IRQs enabled. Since the IRQ handler
may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
Fix it by protecting the state of struct fuse_iqueue with a separate
spinlock, and only accessing fuse_iqueue::waitq using the versions of
the waitqueue functions which do IRQ-safe locking internally.
Reproducer:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/aio_abi.h>
int main()
{
char opts[128];
int fd = open("/dev/fuse", O_RDWR);
aio_context_t ctx = 0;
struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
struct iocb *cbp = &cb;
sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
mkdir("mnt", 0700);
mount("foo", "mnt", "fuse", 0, opts);
syscall(__NR_io_setup, 1, &ctx);
syscall(__NR_io_submit, ctx, 1, &cbp);
}
Beginning of lockdep output:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.3.0-rc5 #9 Not tainted
-----------------------------------------------------
syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
and this task is already holding:
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}
[...]
Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.19+
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-09 06:15:18 +03:00
wake_up_all ( & fiq - > waitq ) ;
spin_unlock ( & fiq - > lock ) ;
2015-07-01 17:26:02 +03:00
kill_fasync ( & fiq - > fasync , SIGIO , POLL_IN ) ;
2015-07-01 17:26:08 +03:00
end_polls ( fc ) ;
wake_up_all ( & fc - > blocked_waitq ) ;
spin_unlock ( & fc - > lock ) ;
2015-07-01 17:26:02 +03:00
2020-04-20 18:59:34 +03:00
end_requests ( & to_end ) ;
2015-07-01 17:26:08 +03:00
} else {
spin_unlock ( & fc - > lock ) ;
2006-01-17 09:14:41 +03:00
}
}
2009-04-14 05:54:53 +04:00
EXPORT_SYMBOL_GPL ( fuse_abort_conn ) ;
2006-01-17 09:14:41 +03:00
2018-07-26 17:13:11 +03:00
void fuse_wait_aborted ( struct fuse_conn * fc )
{
2018-11-09 17:52:16 +03:00
/* matches implicit memory barrier in fuse_drop_waiting() */
smp_mb ( ) ;
2018-07-26 17:13:11 +03:00
wait_event ( fc - > blocked_waitq , atomic_read ( & fc - > num_waiting ) = = 0 ) ;
}
2009-04-14 05:54:53 +04:00
int fuse_dev_release ( struct inode * inode , struct file * file )
2005-09-10 00:10:27 +04:00
{
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( file ) ;
if ( fud ) {
struct fuse_conn * fc = fud - > fc ;
2015-07-01 17:26:09 +03:00
struct fuse_pqueue * fpq = & fud - > pq ;
2018-07-26 17:13:11 +03:00
LIST_HEAD ( to_end ) ;
2018-09-11 13:12:14 +03:00
unsigned int i ;
2015-07-01 17:26:09 +03:00
2018-07-26 17:13:11 +03:00
spin_lock ( & fpq - > lock ) ;
2015-07-01 17:26:09 +03:00
WARN_ON ( ! list_empty ( & fpq - > io ) ) ;
2018-09-11 13:12:14 +03:00
for ( i = 0 ; i < FUSE_PQ_HASH_SIZE ; i + + )
list_splice_init ( & fpq - > processing [ i ] , & to_end ) ;
2018-07-26 17:13:11 +03:00
spin_unlock ( & fpq - > lock ) ;
2020-04-20 18:59:34 +03:00
end_requests ( & to_end ) ;
2018-07-26 17:13:11 +03:00
2015-07-01 17:26:09 +03:00
/* Are we the last open device? */
if ( atomic_dec_and_test ( & fc - > dev_count ) ) {
WARN_ON ( fc - > iq . fasync ! = NULL ) ;
2019-01-24 12:40:16 +03:00
fuse_abort_conn ( fc ) ;
2015-07-01 17:26:09 +03:00
}
2015-07-01 17:26:08 +03:00
fuse_dev_free ( fud ) ;
2006-04-11 09:54:52 +04:00
}
2005-09-10 00:10:27 +04:00
return 0 ;
}
2009-04-14 05:54:53 +04:00
EXPORT_SYMBOL_GPL ( fuse_dev_release ) ;
2005-09-10 00:10:27 +04:00
2006-04-11 09:54:52 +04:00
static int fuse_dev_fasync ( int fd , struct file * file , int on )
{
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud = fuse_get_dev ( file ) ;
if ( ! fud )
2006-04-11 09:54:56 +04:00
return - EPERM ;
2006-04-11 09:54:52 +04:00
/* No locking - fasync_helper does its own locking */
2015-07-01 17:26:08 +03:00
return fasync_helper ( fd , file , on , & fud - > fc - > iq . fasync ) ;
2006-04-11 09:54:52 +04:00
}
2015-07-01 17:26:08 +03:00
static int fuse_device_clone ( struct fuse_conn * fc , struct file * new )
{
2015-07-01 17:26:08 +03:00
struct fuse_dev * fud ;
2015-07-01 17:26:08 +03:00
if ( new - > private_data )
return - EINVAL ;
2019-03-07 00:51:40 +03:00
fud = fuse_dev_alloc_install ( fc ) ;
2015-07-01 17:26:08 +03:00
if ( ! fud )
return - ENOMEM ;
new - > private_data = fud ;
2015-07-01 17:26:09 +03:00
atomic_inc ( & fc - > dev_count ) ;
2015-07-01 17:26:08 +03:00
return 0 ;
}
static long fuse_dev_ioctl ( struct file * file , unsigned int cmd ,
unsigned long arg )
{
2021-01-25 18:30:51 +03:00
int res ;
int oldfd ;
struct fuse_dev * fud = NULL ;
2015-07-01 17:26:08 +03:00
2021-03-19 18:05:14 +03:00
switch ( cmd ) {
case FUSE_DEV_IOC_CLONE :
2021-01-25 18:30:51 +03:00
res = - EFAULT ;
if ( ! get_user ( oldfd , ( __u32 __user * ) arg ) ) {
2015-07-01 17:26:08 +03:00
struct file * old = fget ( oldfd ) ;
2021-01-25 18:30:51 +03:00
res = - EINVAL ;
2015-07-01 17:26:08 +03:00
if ( old ) {
2015-08-16 21:27:01 +03:00
/*
* Check against file - > f_op because CUSE
* uses the same ioctl handler .
*/
fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE
Commit 8ed1f0e22f49e ("fs/fuse: fix ioctl type confusion") fixed a type
confusion bug by adding an ->f_op comparison.
Based on some off-list discussion back then, another check was added to
compare the f_cred->user_ns. This is not for security reasons, but was
based on the idea that a FUSE device FD should be using the UID/GID
mappings of its f_cred->user_ns, and those translations are done using
fc->user_ns, which matches the f_cred->user_ns of the initial FUSE device
FD thanks to the check in fuse_fill_super(). See also commit 8cb08329b0809
("fuse: Support fuse filesystems outside of init_user_ns").
But FUSE_DEV_IOC_CLONE is, at a higher level, a *cloning* operation that
copies an existing context (with a weird API that involves first opening
/dev/fuse, then tying the resulting new FUSE device FD to an existing FUSE
instance). So if an application is already passing FUSE FDs across userns
boundaries and dealing with the resulting ID mapping complications somehow,
it doesn't make much sense to block this cloning operation.
I've heard that this check is an obstacle for some folks, and I don't see a
good reason to keep it, so remove it.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-14 17:26:32 +03:00
if ( old - > f_op = = file - > f_op )
2015-08-16 21:27:01 +03:00
fud = fuse_get_dev ( old ) ;
2015-07-01 17:26:08 +03:00
2015-07-01 17:26:08 +03:00
if ( fud ) {
2015-07-01 17:26:08 +03:00
mutex_lock ( & fuse_mutex ) ;
2021-01-25 18:30:51 +03:00
res = fuse_device_clone ( fud - > fc , file ) ;
2015-07-01 17:26:08 +03:00
mutex_unlock ( & fuse_mutex ) ;
}
fput ( old ) ;
}
}
2021-01-25 18:30:51 +03:00
break ;
default :
res = - ENOTTY ;
break ;
2015-07-01 17:26:08 +03:00
}
2021-01-25 18:30:51 +03:00
return res ;
2015-07-01 17:26:08 +03:00
}
2006-03-28 13:56:42 +04:00
const struct file_operations fuse_dev_operations = {
2005-09-10 00:10:27 +04:00
. owner = THIS_MODULE ,
2015-01-12 07:22:16 +03:00
. open = fuse_dev_open ,
2005-09-10 00:10:27 +04:00
. llseek = no_llseek ,
2015-04-04 04:53:39 +03:00
. read_iter = fuse_dev_read ,
2010-05-25 17:06:07 +04:00
. splice_read = fuse_dev_splice_read ,
2015-04-04 04:53:39 +03:00
. write_iter = fuse_dev_write ,
fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device. The semantics of using splice() are:
1) buffer the message header and data in a temporary pipe
2) with a *single* splice() call move the message from the temporary pipe
to the fuse device
The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer. It will
also allow zero copy moving of pages.
One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header. But the length of
the data transferred into the temporary pipe may not be known in
advance. The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):
struct fuse_out_header out;
iov.iov_base = &out;
iov.iov_len = sizeof(struct fuse_out_header);
vmsplice(pip[1], &iov, 1, 0);
len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
/* retrospectively modify the header: */
out.len = len + sizeof(struct fuse_out_header);
splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.
Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 17:06:06 +04:00
. splice_write = fuse_dev_splice_write ,
2005-09-10 00:10:27 +04:00
. poll = fuse_dev_poll ,
. release = fuse_dev_release ,
2006-04-11 09:54:52 +04:00
. fasync = fuse_dev_fasync ,
2015-07-01 17:26:08 +03:00
. unlocked_ioctl = fuse_dev_ioctl ,
2018-09-11 22:59:08 +03:00
. compat_ioctl = compat_ptr_ioctl ,
2005-09-10 00:10:27 +04:00
} ;
2009-04-14 05:54:53 +04:00
EXPORT_SYMBOL_GPL ( fuse_dev_operations ) ;
2005-09-10 00:10:27 +04:00
static struct miscdevice fuse_miscdevice = {
. minor = FUSE_MINOR ,
. name = " fuse " ,
. fops = & fuse_dev_operations ,
} ;
int __init fuse_dev_init ( void )
{
int err = - ENOMEM ;
fuse_req_cachep = kmem_cache_create ( " fuse_request " ,
sizeof ( struct fuse_req ) ,
2007-07-20 05:11:58 +04:00
0 , 0 , NULL ) ;
2005-09-10 00:10:27 +04:00
if ( ! fuse_req_cachep )
goto out ;
err = misc_register ( & fuse_miscdevice ) ;
if ( err )
goto out_cache_clean ;
return 0 ;
out_cache_clean :
kmem_cache_destroy ( fuse_req_cachep ) ;
out :
return err ;
}
void fuse_dev_cleanup ( void )
{
misc_deregister ( & fuse_miscdevice ) ;
kmem_cache_destroy ( fuse_req_cachep ) ;
}