2005-09-09 13:10:30 -07:00
/*
FUSE : Filesystem in Userspace
2008-11-26 12:03:54 +01:00
Copyright ( C ) 2001 - 2008 Miklos Szeredi < miklos @ szeredi . hu >
2005-09-09 13:10:30 -07:00
This program can be distributed under the terms of the GNU GPL .
See the file COPYING .
*/
# include "fuse_i.h"
# include <linux/pagemap.h>
# include <linux/slab.h>
# include <linux/kernel.h>
Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 01:22:52 +04:00
# include <linux/sched.h>
2017-09-26 12:45:33 -05:00
# include <linux/sched/signal.h>
2009-04-14 10:54:53 +09:00
# include <linux/module.h>
2011-07-25 22:35:35 +02:00
# include <linux/swap.h>
2013-05-17 09:30:32 -04:00
# include <linux/falloc.h>
2015-02-22 08:58:50 -08:00
# include <linux/uio.h>
2020-07-14 19:26:39 +09:00
# include <linux/fs.h>
2022-11-20 09:15:34 -05:00
# include <linux/filelock.h>
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
# include <linux/file.h>
2005-09-09 13:10:30 -07:00
2021-04-07 14:36:45 +02:00
static int fuse_send_open ( struct fuse_mount * fm , u64 nodeid ,
unsigned int open_flags , int opcode ,
struct fuse_open_out * outargp )
2005-09-09 13:10:30 -07:00
{
struct fuse_open_in inarg ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
2005-11-07 00:59:51 -08:00
memset ( & inarg , 0 , sizeof ( inarg ) ) ;
2021-04-07 14:36:45 +02:00
inarg . flags = open_flags & ~ ( O_CREAT | O_EXCL | O_NOCTTY ) ;
2020-05-06 17:44:12 +02:00
if ( ! fm - > fc - > atomic_o_trunc )
2007-10-18 03:07:02 -07:00
inarg . flags & = ~ O_TRUNC ;
2020-10-09 14:15:11 -04:00
if ( fm - > fc - > handle_killpriv_v2 & &
( inarg . flags & O_TRUNC ) & & ! capable ( CAP_FSETID ) ) {
inarg . open_flags | = FUSE_OPEN_KILL_SUIDGID ;
}
2019-09-10 15:04:08 +02:00
args . opcode = opcode ;
args . nodeid = nodeid ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( * outargp ) ;
args . out_args [ 0 ] . value = outargp ;
2005-11-07 00:59:51 -08:00
2020-05-06 17:44:12 +02:00
return fuse_simple_request ( fm , & args ) ;
2005-11-07 00:59:51 -08:00
}
2019-09-10 15:04:10 +02:00
struct fuse_release_args {
struct fuse_args args ;
struct fuse_release_in inarg ;
struct inode * inode ;
} ;
2020-05-06 17:44:12 +02:00
struct fuse_file * fuse_file_alloc ( struct fuse_mount * fm )
2005-11-07 00:59:51 -08:00
{
struct fuse_file * ff ;
2009-04-14 10:54:49 +09:00
2019-09-17 12:35:33 -07:00
ff = kzalloc ( sizeof ( struct fuse_file ) , GFP_KERNEL_ACCOUNT ) ;
2009-04-14 10:54:49 +09:00
if ( unlikely ( ! ff ) )
return NULL ;
2020-05-06 17:44:12 +02:00
ff - > fm = fm ;
2019-09-17 12:35:33 -07:00
ff - > release_args = kzalloc ( sizeof ( * ff - > release_args ) ,
GFP_KERNEL_ACCOUNT ) ;
2019-09-10 15:04:10 +02:00
if ( ! ff - > release_args ) {
2009-04-14 10:54:49 +09:00
kfree ( ff ) ;
return NULL ;
2005-11-07 00:59:51 -08:00
}
2009-04-14 10:54:49 +09:00
INIT_LIST_HEAD ( & ff - > write_entry ) ;
2018-10-01 10:07:04 +02:00
mutex_init ( & ff - > readdir . lock ) ;
2017-03-03 11:04:03 +02:00
refcount_set ( & ff - > count , 1 ) ;
2009-04-14 10:54:49 +09:00
RB_CLEAR_NODE ( & ff - > polled_node ) ;
init_waitqueue_head ( & ff - > poll_wait ) ;
2020-05-06 17:44:12 +02:00
ff - > kh = atomic64_inc_return ( & fm - > fc - > khctr ) ;
2009-04-14 10:54:49 +09:00
2005-11-07 00:59:51 -08:00
return ff ;
}
void fuse_file_free ( struct fuse_file * ff )
{
2019-09-10 15:04:10 +02:00
kfree ( ff - > release_args ) ;
2018-10-01 10:07:04 +02:00
mutex_destroy ( & ff - > readdir . lock ) ;
2005-11-07 00:59:51 -08:00
kfree ( ff ) ;
}
2017-02-22 20:08:25 +01:00
static struct fuse_file * fuse_file_get ( struct fuse_file * ff )
2007-10-16 23:31:00 -07:00
{
2017-03-03 11:04:03 +02:00
refcount_inc ( & ff - > count ) ;
2007-10-16 23:31:00 -07:00
return ff ;
}
2020-05-06 17:44:12 +02:00
static void fuse_release_end ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 15:04:10 +02:00
int error )
2007-10-16 23:31:04 -07:00
{
2019-09-10 15:04:10 +02:00
struct fuse_release_args * ra = container_of ( args , typeof ( * ra ) , args ) ;
iput ( ra - > inode ) ;
kfree ( ra ) ;
2007-10-16 23:31:04 -07:00
}
2018-12-10 10:54:52 -08:00
static void fuse_file_put ( struct fuse_file * ff , bool sync , bool isdir )
2007-10-16 23:31:00 -07:00
{
2017-03-03 11:04:03 +02:00
if ( refcount_dec_and_test ( & ff - > count ) ) {
2019-09-10 15:04:10 +02:00
struct fuse_args * args = & ff - > release_args - > args ;
2009-04-28 16:56:39 +02:00
2020-05-06 17:44:12 +02:00
if ( isdir ? ff - > fm - > fc - > no_opendir : ff - > fm - > fc - > no_open ) {
2019-09-10 15:04:10 +02:00
/* Do nothing when client does not implement 'open' */
2020-05-06 17:44:12 +02:00
fuse_release_end ( ff - > fm , args , 0 ) ;
2013-11-05 16:05:52 +01:00
} else if ( sync ) {
2020-05-06 17:44:12 +02:00
fuse_simple_request ( ff - > fm , args ) ;
fuse_release_end ( ff - > fm , args , 0 ) ;
2011-02-25 14:44:58 +01:00
} else {
2019-09-10 15:04:10 +02:00
args - > end = fuse_release_end ;
2020-05-06 17:44:12 +02:00
if ( fuse_simple_background ( ff - > fm , args ,
2019-09-10 15:04:10 +02:00
GFP_KERNEL | __GFP_NOFAIL ) )
2020-05-06 17:44:12 +02:00
fuse_release_end ( ff - > fm , args , - ENOTCONN ) ;
2011-02-25 14:44:58 +01:00
}
2007-10-16 23:31:00 -07:00
kfree ( ff ) ;
}
}
2021-04-07 14:36:45 +02:00
struct fuse_file * fuse_file_open ( struct fuse_mount * fm , u64 nodeid ,
unsigned int open_flags , bool isdir )
2009-04-28 16:56:37 +02:00
{
2020-05-06 17:44:12 +02:00
struct fuse_conn * fc = fm - > fc ;
2009-04-28 16:56:37 +02:00
struct fuse_file * ff ;
int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN ;
2020-05-06 17:44:12 +02:00
ff = fuse_file_alloc ( fm ) ;
2009-04-28 16:56:37 +02:00
if ( ! ff )
2021-04-07 14:36:45 +02:00
return ERR_PTR ( - ENOMEM ) ;
2009-04-28 16:56:37 +02:00
2013-11-05 16:05:52 +01:00
ff - > fh = 0 ;
2019-01-28 16:34:34 -08:00
/* Default for no-open */
ff - > open_flags = FOPEN_KEEP_CACHE | ( isdir ? FOPEN_CACHE_DIR : 0 ) ;
2019-01-07 16:53:17 -08:00
if ( isdir ? ! fc - > no_opendir : ! fc - > no_open ) {
2013-11-05 16:05:52 +01:00
struct fuse_open_out outarg ;
int err ;
2021-04-07 14:36:45 +02:00
err = fuse_send_open ( fm , nodeid , open_flags , opcode , & outarg ) ;
2013-11-05 16:05:52 +01:00
if ( ! err ) {
ff - > fh = outarg . fh ;
ff - > open_flags = outarg . open_flags ;
2019-01-07 16:53:17 -08:00
} else if ( err ! = - ENOSYS ) {
2013-11-05 16:05:52 +01:00
fuse_file_free ( ff ) ;
2021-04-07 14:36:45 +02:00
return ERR_PTR ( err ) ;
2013-11-05 16:05:52 +01:00
} else {
2019-01-07 16:53:17 -08:00
if ( isdir )
fc - > no_opendir = 1 ;
else
fc - > no_open = 1 ;
2013-11-05 16:05:52 +01:00
}
2009-04-28 16:56:37 +02:00
}
if ( isdir )
2013-11-05 16:05:52 +01:00
ff - > open_flags & = ~ FOPEN_DIRECT_IO ;
2009-04-28 16:56:37 +02:00
ff - > nodeid = nodeid ;
2021-04-07 14:36:45 +02:00
return ff ;
}
int fuse_do_open ( struct fuse_mount * fm , u64 nodeid , struct file * file ,
bool isdir )
{
struct fuse_file * ff = fuse_file_open ( fm , nodeid , file - > f_flags , isdir ) ;
if ( ! IS_ERR ( ff ) )
file - > private_data = ff ;
return PTR_ERR_OR_ZERO ( ff ) ;
2009-04-28 16:56:37 +02:00
}
2009-04-14 10:54:53 +09:00
EXPORT_SYMBOL_GPL ( fuse_do_open ) ;
2009-04-28 16:56:37 +02:00
2013-10-10 17:10:04 +04:00
static void fuse_link_write_file ( struct file * file )
{
struct inode * inode = file_inode ( file ) ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
struct fuse_file * ff = file - > private_data ;
/*
* file may be written through mmap , so chain it onto the
* inodes ' s write_file list
*/
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2013-10-10 17:10:04 +04:00
if ( list_empty ( & ff - > write_entry ) )
list_add ( & ff - > write_entry , & fi - > write_files ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2013-10-10 17:10:04 +04:00
}
2009-04-28 16:56:37 +02:00
void fuse_finish_open ( struct inode * inode , struct file * file )
2005-11-07 00:59:51 -08:00
{
2009-04-28 16:56:37 +02:00
struct fuse_file * ff = file - > private_data ;
2010-11-24 12:57:00 -08:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2009-04-28 16:56:37 +02:00
2019-04-24 07:13:57 +00:00
if ( ff - > open_flags & FOPEN_STREAM )
stream_open ( inode , file ) ;
else if ( ff - > open_flags & FOPEN_NONSEEKABLE )
2008-10-16 16:08:57 +02:00
nonseekable_open ( inode , file ) ;
2021-08-17 21:05:16 +02:00
2010-11-24 12:57:00 -08:00
if ( fc - > atomic_o_trunc & & ( file - > f_flags & O_TRUNC ) ) {
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2018-11-09 13:33:17 +03:00
fi - > attr_version = atomic64_inc_return ( & fc - > attr_version ) ;
2010-11-24 12:57:00 -08:00
i_size_write ( inode , 0 ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2021-10-22 17:03:03 +02:00
file_update_time ( file ) ;
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , FUSE_STATX_MODSIZE ) ;
2010-11-24 12:57:00 -08:00
}
2013-10-10 17:12:18 +04:00
if ( ( file - > f_mode & FMODE_WRITE ) & & fc - > writeback_cache )
fuse_link_write_file ( file ) ;
2005-11-07 00:59:51 -08:00
}
2009-04-28 16:56:37 +02:00
int fuse_open_common ( struct inode * inode , struct file * file , bool isdir )
2005-11-07 00:59:51 -08:00
{
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
struct fuse_conn * fc = fm - > fc ;
2005-09-09 13:10:30 -07:00
int err ;
2019-10-23 14:26:37 +02:00
bool is_wb_truncate = ( file - > f_flags & O_TRUNC ) & &
2014-04-28 14:19:22 +02:00
fc - > atomic_o_trunc & &
fc - > writeback_cache ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
bool dax_truncate = ( file - > f_flags & O_TRUNC ) & &
fc - > atomic_o_trunc & & FUSE_IS_DAX ( inode ) ;
2005-09-09 13:10:30 -07:00
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
return - EIO ;
2005-09-09 13:10:30 -07:00
err = generic_file_open ( inode , file ) ;
if ( err )
return err ;
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-22 15:48:53 +02:00
if ( is_wb_truncate | | dax_truncate )
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2014-04-28 14:19:22 +02:00
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
if ( dax_truncate ) {
2021-04-21 17:18:39 +02:00
filemap_invalidate_lock ( inode - > i_mapping ) ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
err = fuse_dax_break_layouts ( inode , 0 , 0 ) ;
if ( err )
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-22 15:48:53 +02:00
goto out_inode_unlock ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
}
2005-09-09 13:10:30 -07:00
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-22 15:48:53 +02:00
if ( is_wb_truncate | | dax_truncate )
fuse_set_nowrite ( inode ) ;
2020-05-06 17:44:12 +02:00
err = fuse_do_open ( fm , get_node_id ( inode ) , file , isdir ) ;
2014-04-28 14:19:22 +02:00
if ( ! err )
fuse_finish_open ( inode , file ) ;
2009-04-28 16:56:37 +02:00
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-22 15:48:53 +02:00
if ( is_wb_truncate | | dax_truncate )
fuse_release_nowrite ( inode ) ;
if ( ! err ) {
struct fuse_file * ff = file - > private_data ;
if ( fc - > atomic_o_trunc & & ( file - > f_flags & O_TRUNC ) )
truncate_pagecache ( inode , 0 ) ;
else if ( ! ( ff - > open_flags & FOPEN_KEEP_CACHE ) )
invalidate_inode_pages2 ( inode - > i_mapping ) ;
}
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
if ( dax_truncate )
2021-04-21 17:18:39 +02:00
filemap_invalidate_unlock ( inode - > i_mapping ) ;
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-22 15:48:53 +02:00
out_inode_unlock :
if ( is_wb_truncate | | dax_truncate )
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2014-04-28 14:19:22 +02:00
return err ;
2005-09-09 13:10:30 -07:00
}
2018-11-09 13:33:11 +03:00
static void fuse_prepare_release ( struct fuse_inode * fi , struct fuse_file * ff ,
2021-04-07 14:36:45 +02:00
unsigned int flags , int opcode )
2006-01-16 22:14:42 -08:00
{
2020-05-06 17:44:12 +02:00
struct fuse_conn * fc = ff - > fm - > fc ;
2019-09-10 15:04:10 +02:00
struct fuse_release_args * ra = ff - > release_args ;
2005-09-09 13:10:30 -07:00
2018-11-09 13:33:22 +03:00
/* Inode is NULL on error path of fuse_create_open() */
if ( likely ( fi ) ) {
spin_lock ( & fi - > lock ) ;
list_del ( & ff - > write_entry ) ;
spin_unlock ( & fi - > lock ) ;
}
2009-04-28 16:56:39 +02:00
spin_lock ( & fc - > lock ) ;
if ( ! RB_EMPTY_NODE ( & ff - > polled_node ) )
rb_erase ( & ff - > polled_node , & fc - > polled_files ) ;
spin_unlock ( & fc - > lock ) ;
2011-03-01 16:43:52 -08:00
wake_up_interruptible_all ( & ff - > poll_wait ) ;
2009-04-28 16:56:39 +02:00
2019-09-10 15:04:10 +02:00
ra - > inarg . fh = ff - > fh ;
ra - > inarg . flags = flags ;
ra - > args . in_numargs = 1 ;
ra - > args . in_args [ 0 ] . size = sizeof ( struct fuse_release_in ) ;
ra - > args . in_args [ 0 ] . value = & ra - > inarg ;
ra - > args . opcode = opcode ;
ra - > args . nodeid = ff - > nodeid ;
ra - > args . force = true ;
ra - > args . nocreds = true ;
2005-11-07 00:59:51 -08:00
}
2021-04-07 14:36:45 +02:00
void fuse_file_release ( struct inode * inode , struct fuse_file * ff ,
unsigned int open_flags , fl_owner_t id , bool isdir )
2005-11-07 00:59:51 -08:00
{
2021-04-07 14:36:45 +02:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2019-09-10 15:04:10 +02:00
struct fuse_release_args * ra = ff - > release_args ;
2018-12-10 10:54:52 -08:00
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE ;
2009-04-14 10:54:49 +09:00
2021-04-07 14:36:45 +02:00
fuse_prepare_release ( fi , ff , open_flags , opcode ) ;
2009-04-14 10:54:49 +09:00
2011-08-08 16:08:08 +02:00
if ( ff - > flock ) {
2019-09-10 15:04:10 +02:00
ra - > inarg . release_flags | = FUSE_RELEASE_FLOCK_UNLOCK ;
2021-04-07 14:36:45 +02:00
ra - > inarg . lock_owner = fuse_lock_owner_id ( ff - > fm - > fc , id ) ;
2011-08-08 16:08:08 +02:00
}
2014-12-12 09:49:04 +01:00
/* Hold inode until release is finished */
2021-04-07 14:36:45 +02:00
ra - > inode = igrab ( inode ) ;
2009-04-14 10:54:49 +09:00
/*
* Normally this will send the RELEASE request , however if
* some asynchronous READ or WRITE requests are outstanding ,
* the sending will be delayed .
2011-02-25 14:44:58 +01:00
*
* Make the release synchronous if this is a fuseblk mount ,
* synchronous RELEASE is allowed ( and desirable ) in this case
* because the server can be trusted not to screw up .
2009-04-14 10:54:49 +09:00
*/
2020-05-06 17:44:12 +02:00
fuse_file_put ( ff , ff - > fm - > fc - > destroy , isdir ) ;
2005-09-09 13:10:30 -07:00
}
2021-04-07 14:36:45 +02:00
void fuse_release_common ( struct file * file , bool isdir )
{
fuse_file_release ( file_inode ( file ) , file - > private_data , file - > f_flags ,
( fl_owner_t ) file , isdir ) ;
}
2005-09-09 13:10:36 -07:00
static int fuse_open ( struct inode * inode , struct file * file )
{
2009-04-28 16:56:37 +02:00
return fuse_open_common ( inode , file , false ) ;
2005-09-09 13:10:36 -07:00
}
static int fuse_release ( struct inode * inode , struct file * file )
{
2022-04-20 16:05:41 +02:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
/*
* Dirty pages might remain despite write_inode_now ( ) call from
* fuse_flush ( ) due to writes racing with the close .
*/
if ( fc - > writeback_cache )
write_inode_now ( inode , 1 ) ;
2018-12-10 10:54:52 -08:00
fuse_release_common ( file , false ) ;
2009-04-28 16:56:39 +02:00
/* return value is ignored by VFS */
return 0 ;
}
2021-04-07 14:36:45 +02:00
void fuse_sync_release ( struct fuse_inode * fi , struct fuse_file * ff ,
unsigned int flags )
2009-04-28 16:56:39 +02:00
{
2017-03-03 11:04:03 +02:00
WARN_ON ( refcount_read ( & ff - > count ) > 1 ) ;
2018-11-09 13:33:11 +03:00
fuse_prepare_release ( fi , ff , flags , FUSE_RELEASE ) ;
2017-02-22 20:08:25 +01:00
/*
* iput ( NULL ) is a no - op and since the refcount is 1 and everything ' s
* synchronous , we are fine with not doing igrab ( ) here "
*/
2018-12-10 10:54:52 -08:00
fuse_file_put ( ff , true , false ) ;
2005-09-09 13:10:36 -07:00
}
2009-04-14 10:54:53 +09:00
EXPORT_SYMBOL_GPL ( fuse_sync_release ) ;
2005-09-09 13:10:36 -07:00
2006-06-25 05:48:52 -07:00
/*
2006-06-25 05:48:55 -07:00
* Scramble the ID space with XTEA , so that the value of the files_struct
* pointer is not exposed to userspace .
2006-06-25 05:48:52 -07:00
*/
2007-10-18 03:07:04 -07:00
u64 fuse_lock_owner_id ( struct fuse_conn * fc , fl_owner_t id )
2006-06-25 05:48:52 -07:00
{
2006-06-25 05:48:55 -07:00
u32 * k = fc - > scramble_key ;
u64 v = ( unsigned long ) id ;
u32 v0 = v ;
u32 v1 = v > > 32 ;
u32 sum = 0 ;
int i ;
for ( i = 0 ; i < 32 ; i + + ) {
v0 + = ( ( v1 < < 4 ^ v1 > > 5 ) + v1 ) ^ ( sum + k [ sum & 3 ] ) ;
sum + = 0x9E3779B9 ;
v1 + = ( ( v0 < < 4 ^ v0 > > 5 ) + v0 ) ^ ( sum + k [ sum > > 11 & 3 ] ) ;
}
return ( u64 ) v0 + ( ( u64 ) v1 < < 32 ) ;
2006-06-25 05:48:52 -07:00
}
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args {
struct fuse_io_args ia ;
2019-09-19 17:11:20 +03:00
struct rb_node writepages_entry ;
2019-09-10 15:04:10 +02:00
struct list_head queue_entry ;
struct fuse_writepage_args * next ;
struct inode * inode ;
2021-09-01 12:39:02 +02:00
struct fuse_sync_bucket * bucket ;
2019-09-10 15:04:10 +02:00
} ;
static struct fuse_writepage_args * fuse_find_writeback ( struct fuse_inode * fi ,
2019-01-16 10:27:59 +01:00
pgoff_t idx_from , pgoff_t idx_to )
{
2019-09-19 17:11:20 +03:00
struct rb_node * n ;
n = fi - > writepages . rb_node ;
2019-01-16 10:27:59 +01:00
2019-09-19 17:11:20 +03:00
while ( n ) {
struct fuse_writepage_args * wpa ;
2019-01-16 10:27:59 +01:00
pgoff_t curr_index ;
2019-09-19 17:11:20 +03:00
wpa = rb_entry ( n , struct fuse_writepage_args , writepages_entry ) ;
2019-09-10 15:04:10 +02:00
WARN_ON ( get_fuse_inode ( wpa - > inode ) ! = fi ) ;
curr_index = wpa - > ia . write . in . offset > > PAGE_SHIFT ;
2019-09-19 17:11:20 +03:00
if ( idx_from > = curr_index + wpa - > ia . ap . num_pages )
n = n - > rb_right ;
else if ( idx_to < curr_index )
n = n - > rb_left ;
else
2019-09-10 15:04:10 +02:00
return wpa ;
2019-01-16 10:27:59 +01:00
}
return NULL ;
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
/*
2013-10-10 17:12:05 +04:00
* Check if any page in a range is under writeback
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
*
* This is currently done by walking the list of writepage requests
* for the inode , which can be pretty inefficient .
*/
2013-10-10 17:12:05 +04:00
static bool fuse_range_is_writeback ( struct inode * inode , pgoff_t idx_from ,
pgoff_t idx_to )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2019-01-16 10:27:59 +01:00
bool found ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2019-01-16 10:27:59 +01:00
found = fuse_find_writeback ( fi , idx_from , idx_to ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
return found ;
}
2013-10-10 17:12:05 +04:00
static inline bool fuse_page_is_writeback ( struct inode * inode , pgoff_t index )
{
return fuse_range_is_writeback ( inode , index , index ) ;
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
/*
* Wait for page writeback to be completed .
*
* Since fuse doesn ' t rely on the VM writeback tracking , this has to
* use some other means .
*/
2019-07-22 10:17:17 +03:00
static void fuse_wait_on_page_writeback ( struct inode * inode , pgoff_t index )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
wait_event ( fi - > page_waitq , ! fuse_page_is_writeback ( inode , index ) ) ;
}
2013-10-10 17:11:54 +04:00
/*
* Wait for all pending writepages on the inode to finish .
*
* This is currently done by blocking further writes with FUSE_NOWRITE
* and waiting for all sent writes to complete .
*
* This must be called under i_mutex , otherwise the FUSE_NOWRITE usage
* could conflict with truncation .
*/
static void fuse_sync_writes ( struct inode * inode )
{
fuse_set_nowrite ( inode ) ;
fuse_release_nowrite ( inode ) ;
}
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
struct fuse_flush_args {
struct fuse_args args ;
2005-09-09 13:10:30 -07:00
struct fuse_flush_in inarg ;
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
struct work_struct work ;
struct file * file ;
} ;
2006-01-06 00:19:39 -08:00
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
static int fuse_do_flush ( struct fuse_flush_args * fa )
{
int err ;
struct inode * inode = file_inode ( fa - > file ) ;
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2021-10-24 16:26:07 +03:00
2014-04-28 14:19:23 +02:00
err = write_inode_now ( inode , 1 ) ;
2013-10-10 17:11:54 +04:00
if ( err )
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
goto out ;
2013-10-10 17:11:54 +04:00
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2013-10-10 17:11:54 +04:00
fuse_sync_writes ( inode ) ;
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2013-10-10 17:11:54 +04:00
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
err = filemap_check_errors ( fa - > file - > f_mapping ) ;
2016-07-19 18:12:26 -07:00
if ( err )
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
goto out ;
2016-07-19 18:12:26 -07:00
2020-05-19 14:50:37 +02:00
err = 0 ;
2020-05-06 17:44:12 +02:00
if ( fm - > fc - > no_flush )
2020-05-19 14:50:37 +02:00
goto inval_attr_out ;
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
err = fuse_simple_request ( fm , & fa - > args ) ;
2005-09-09 13:10:30 -07:00
if ( err = = - ENOSYS ) {
2020-05-06 17:44:12 +02:00
fm - > fc - > no_flush = 1 ;
2005-09-09 13:10:30 -07:00
err = 0 ;
}
fuse: invalidate inode attr in writeback cache mode
Under writeback mode, inode->i_blocks is not updated, making utils du
read st.blocks as 0.
For example, when using virtiofs (cache=always & nondax mode) with
writeback_cache enabled, writing a new file and check its disk usage
with du, du reports 0 usage.
# uname -r
5.6.0-rc6+
# mount -t virtiofs virtiofs /mnt/virtiofs
# rm -f /mnt/virtiofs/testfile
# create new file and do extend write
# xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
# du -k /mnt/virtiofs/testfile
0 <==== disk usage is 0
# stat -c %s,%b /mnt/virtiofs/testfile
4096,0 <==== i_size is correct, but st_blocks is 0
Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
from server on next getattr.
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-12 10:29:04 +08:00
inval_attr_out :
/*
* In memory i_blocks is not maintained by fuse , if writeback cache is
* enabled , i_blocks from cached attr may not be accurate .
*/
2020-05-06 17:44:12 +02:00
if ( ! err & & fm - > fc - > writeback_cache )
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , STATX_BLOCKS ) ;
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
out :
fput ( fa - > file ) ;
kfree ( fa ) ;
2005-09-09 13:10:30 -07:00
return err ;
}
fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:
1. task (a thread in the fuse server, with a fuse file open) starts
exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
it has PF_EXITING.
The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.
To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.
To fix this, let's not block on flush operations when the current task has
PF_EXITING.
This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:
"remote" posix locks are almost never used due to problems like this, so
I think it's safe to do this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
static void fuse_flush_async ( struct work_struct * work )
{
struct fuse_flush_args * fa = container_of ( work , typeof ( * fa ) , work ) ;
fuse_do_flush ( fa ) ;
}
static int fuse_flush ( struct file * file , fl_owner_t id )
{
struct fuse_flush_args * fa ;
struct inode * inode = file_inode ( file ) ;
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
struct fuse_file * ff = file - > private_data ;
if ( fuse_is_bad ( inode ) )
return - EIO ;
if ( ff - > open_flags & FOPEN_NOFLUSH & & ! fm - > fc - > writeback_cache )
return 0 ;
fa = kzalloc ( sizeof ( * fa ) , GFP_KERNEL ) ;
if ( ! fa )
return - ENOMEM ;
fa - > inarg . fh = ff - > fh ;
fa - > inarg . lock_owner = fuse_lock_owner_id ( fm - > fc , id ) ;
fa - > args . opcode = FUSE_FLUSH ;
fa - > args . nodeid = get_node_id ( inode ) ;
fa - > args . in_numargs = 1 ;
fa - > args . in_args [ 0 ] . size = sizeof ( fa - > inarg ) ;
fa - > args . in_args [ 0 ] . value = & fa - > inarg ;
fa - > args . force = true ;
fa - > file = get_file ( file ) ;
/* Don't wait if the task is exiting */
if ( current - > flags & PF_EXITING ) {
INIT_WORK ( & fa - > work , fuse_flush_async ) ;
schedule_work ( & fa - > work ) ;
return 0 ;
}
return fuse_do_flush ( fa ) ;
}
2011-07-16 20:44:56 -04:00
int fuse_fsync_common ( struct file * file , loff_t start , loff_t end ,
2018-12-03 10:14:43 +01:00
int datasync , int opcode )
2005-09-09 13:10:30 -07:00
{
2010-05-26 17:53:25 +02:00
struct inode * inode = file - > f_mapping - > host ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2005-09-09 13:10:30 -07:00
struct fuse_file * ff = file - > private_data ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
2005-09-09 13:10:30 -07:00
struct fuse_fsync_in inarg ;
2018-12-03 10:14:43 +01:00
memset ( & inarg , 0 , sizeof ( inarg ) ) ;
inarg . fh = ff - > fh ;
2019-04-19 15:42:44 -06:00
inarg . fsync_flags = datasync ? FUSE_FSYNC_FDATASYNC : 0 ;
2019-09-10 15:04:08 +02:00
args . opcode = opcode ;
args . nodeid = get_node_id ( inode ) ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
2020-05-06 17:44:12 +02:00
return fuse_simple_request ( fm , & args ) ;
2018-12-03 10:14:43 +01:00
}
static int fuse_fsync ( struct file * file , loff_t start , loff_t end ,
int datasync )
{
struct inode * inode = file - > f_mapping - > host ;
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2005-09-09 13:10:30 -07:00
int err ;
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2006-01-06 00:19:39 -08:00
return - EIO ;
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2011-07-16 20:44:56 -04:00
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
/*
* Start writeback against all dirty pages of the inode , then
* wait for all outstanding writes , before sending the FSYNC
* request .
*/
2017-07-22 09:27:43 -04:00
err = file_write_and_wait_range ( file , start , end ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
if ( err )
2011-07-16 20:44:56 -04:00
goto out ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
fuse_sync_writes ( inode ) ;
2016-07-19 12:48:01 -07:00
/*
* Due to implementation of fuse writeback
2017-07-22 09:27:43 -04:00
* file_write_and_wait_range ( ) does not catch errors .
2016-07-19 12:48:01 -07:00
* We have to do this directly after fuse_sync_writes ( )
*/
2017-07-22 09:27:43 -04:00
err = file_check_and_advance_wb_err ( file ) ;
2016-07-19 12:48:01 -07:00
if ( err )
goto out ;
2014-04-28 14:19:23 +02:00
err = sync_inode_metadata ( inode , 1 ) ;
if ( err )
goto out ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2018-12-03 10:14:43 +01:00
if ( fc - > no_fsync )
2014-04-28 14:19:23 +02:00
goto out ;
2013-12-26 19:51:11 +04:00
2018-12-03 10:14:43 +01:00
err = fuse_fsync_common ( file , start , end , datasync , FUSE_FSYNC ) ;
2005-09-09 13:10:30 -07:00
if ( err = = - ENOSYS ) {
2018-12-03 10:14:43 +01:00
fc - > no_fsync = 1 ;
2005-09-09 13:10:30 -07:00
err = 0 ;
}
2011-07-16 20:44:56 -04:00
out :
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2005-09-09 13:10:30 -07:00
2018-12-03 10:14:43 +01:00
return err ;
2005-09-09 13:10:38 -07:00
}
2019-09-10 15:04:09 +02:00
void fuse_read_args_fill ( struct fuse_io_args * ia , struct file * file , loff_t pos ,
size_t count , int opcode )
{
struct fuse_file * ff = file - > private_data ;
struct fuse_args * args = & ia - > ap . args ;
ia - > read . in . fh = ff - > fh ;
ia - > read . in . offset = pos ;
ia - > read . in . size = count ;
ia - > read . in . flags = file - > f_flags ;
args - > opcode = opcode ;
args - > nodeid = ff - > nodeid ;
args - > in_numargs = 1 ;
args - > in_args [ 0 ] . size = sizeof ( ia - > read . in ) ;
args - > in_args [ 0 ] . value = & ia - > read . in ;
args - > out_argvar = true ;
args - > out_numargs = 1 ;
args - > out_args [ 0 ] . size = count ;
}
2019-09-10 15:04:10 +02:00
static void fuse_release_user_pages ( struct fuse_args_pages * ap ,
bool should_dirty )
2012-12-14 19:20:25 +04:00
{
2019-09-10 15:04:10 +02:00
unsigned int i ;
2012-12-14 19:20:25 +04:00
2019-09-10 15:04:10 +02:00
for ( i = 0 ; i < ap - > num_pages ; i + + ) {
2016-08-24 18:17:04 +02:00
if ( should_dirty )
2019-09-10 15:04:10 +02:00
set_page_dirty_lock ( ap - > pages [ i ] ) ;
put_page ( ap - > pages [ i ] ) ;
2012-12-14 19:20:25 +04:00
}
}
2016-03-11 10:35:34 -06:00
static void fuse_io_release ( struct kref * kref )
{
kfree ( container_of ( kref , struct fuse_io_priv , refcnt ) ) ;
}
2015-02-02 14:59:43 +01:00
static ssize_t fuse_get_res_by_io ( struct fuse_io_priv * io )
{
if ( io - > err )
return io - > err ;
if ( io - > bytes > = 0 & & io - > write )
return - EIO ;
return io - > bytes < 0 ? io - > size : io - > bytes ;
}
2023-01-08 17:00:23 -08:00
/*
2012-12-14 19:20:41 +04:00
* In case of short read , the caller sets ' pos ' to the position of
* actual end of fuse request in IO request . Otherwise , if bytes_requested
* = = bytes_transferred or rw = = WRITE , the caller sets ' pos ' to - 1.
*
* An example :
2021-06-04 09:46:17 +08:00
* User requested DIO read of 64 K . It was split into two 32 K fuse requests ,
2012-12-14 19:20:41 +04:00
* both submitted asynchronously . The first of them was ACKed by userspace as
* fully completed ( req - > out . args [ 0 ] . size = = 32 K ) resulting in pos = = - 1. The
* second request was ACKed as short , e . g . only 1 K was read , resulting in
* pos = = 33 K .
*
* Thus , when all fuse requests are completed , the minimal non - negative ' pos '
* will be equal to the length of the longest contiguous fragment of
* transferred data starting from the beginning of IO request .
*/
static void fuse_aio_complete ( struct fuse_io_priv * io , int err , ssize_t pos )
{
int left ;
spin_lock ( & io - > lock ) ;
if ( err )
io - > err = io - > err ? : err ;
else if ( pos > = 0 & & ( io - > bytes < 0 | | pos < io - > bytes ) )
io - > bytes = pos ;
left = - - io - > reqs ;
2016-04-07 17:18:11 +05:30
if ( ! left & & io - > blocking )
2015-02-02 14:59:43 +01:00
complete ( io - > done ) ;
2012-12-14 19:20:41 +04:00
spin_unlock ( & io - > lock ) ;
2016-04-07 17:18:11 +05:30
if ( ! left & & ! io - > blocking ) {
2015-02-02 14:59:43 +01:00
ssize_t res = fuse_get_res_by_io ( io ) ;
2012-12-14 19:20:41 +04:00
2015-02-02 14:59:43 +01:00
if ( res > = 0 ) {
struct inode * inode = file_inode ( io - > iocb - > ki_filp ) ;
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2012-12-14 19:20:41 +04:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2018-11-09 13:33:17 +03:00
fi - > attr_version = atomic64_inc_return ( & fc - > attr_version ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2012-12-14 19:20:41 +04:00
}
2021-10-21 09:22:35 -06:00
io - > iocb - > ki_complete ( io - > iocb , res ) ;
2012-12-14 19:20:41 +04:00
}
2016-03-11 10:35:34 -06:00
kref_put ( & io - > refcnt , fuse_io_release ) ;
2012-12-14 19:20:41 +04:00
}
2019-09-10 15:04:10 +02:00
static struct fuse_io_args * fuse_io_alloc ( struct fuse_io_priv * io ,
unsigned int npages )
{
struct fuse_io_args * ia ;
ia = kzalloc ( sizeof ( * ia ) , GFP_KERNEL ) ;
if ( ia ) {
ia - > io = io ;
ia - > ap . pages = fuse_pages_alloc ( npages , GFP_KERNEL ,
& ia - > ap . descs ) ;
if ( ! ia - > ap . pages ) {
kfree ( ia ) ;
ia = NULL ;
}
}
return ia ;
}
static void fuse_io_free ( struct fuse_io_args * ia )
2012-12-14 19:20:41 +04:00
{
2019-09-10 15:04:10 +02:00
kfree ( ia - > ap . pages ) ;
kfree ( ia ) ;
}
2020-05-06 17:44:12 +02:00
static void fuse_aio_complete_req ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 15:04:10 +02:00
int err )
{
struct fuse_io_args * ia = container_of ( args , typeof ( * ia ) , ap . args ) ;
struct fuse_io_priv * io = ia - > io ;
2012-12-14 19:20:41 +04:00
ssize_t pos = - 1 ;
2019-09-10 15:04:10 +02:00
fuse_release_user_pages ( & ia - > ap , io - > should_dirty ) ;
2012-12-14 19:20:41 +04:00
2019-09-10 15:04:10 +02:00
if ( err ) {
/* Nothing */
} else if ( io - > write ) {
if ( ia - > write . out . size > ia - > write . in . size ) {
err = - EIO ;
} else if ( ia - > write . in . size ! = ia - > write . out . size ) {
pos = ia - > write . in . offset - io - > offset +
ia - > write . out . size ;
}
2012-12-14 19:20:41 +04:00
} else {
2019-09-10 15:04:10 +02:00
u32 outsize = args - > out_args [ 0 ] . size ;
if ( ia - > read . in . size ! = outsize )
pos = ia - > read . in . offset - io - > offset + outsize ;
2012-12-14 19:20:41 +04:00
}
2019-09-10 15:04:10 +02:00
fuse_aio_complete ( io , err , pos ) ;
fuse_io_free ( ia ) ;
2012-12-14 19:20:41 +04:00
}
2020-05-06 17:44:12 +02:00
static ssize_t fuse_async_req_send ( struct fuse_mount * fm ,
2019-09-10 15:04:10 +02:00
struct fuse_io_args * ia , size_t num_bytes )
2012-12-14 19:20:41 +04:00
{
2019-09-10 15:04:10 +02:00
ssize_t err ;
struct fuse_io_priv * io = ia - > io ;
2012-12-14 19:20:41 +04:00
spin_lock ( & io - > lock ) ;
2016-03-11 10:35:34 -06:00
kref_get ( & io - > refcnt ) ;
2012-12-14 19:20:41 +04:00
io - > size + = num_bytes ;
io - > reqs + + ;
spin_unlock ( & io - > lock ) ;
2019-09-10 15:04:10 +02:00
ia - > ap . args . end = fuse_aio_complete_req ;
2020-04-20 17:01:34 +02:00
ia - > ap . args . may_block = io - > should_dirty ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_background ( fm , & ia - > ap . args , GFP_KERNEL ) ;
2019-11-25 20:48:46 +01:00
if ( err )
2020-05-06 17:44:12 +02:00
fuse_aio_complete_req ( fm , & ia - > ap . args , err ) ;
2012-12-14 19:20:41 +04:00
2019-11-25 20:48:46 +01:00
return num_bytes ;
2012-12-14 19:20:41 +04:00
}
2019-09-10 15:04:10 +02:00
static ssize_t fuse_send_read ( struct fuse_io_args * ia , loff_t pos , size_t count ,
fl_owner_t owner )
2005-09-09 13:10:36 -07:00
{
2019-09-10 15:04:10 +02:00
struct file * file = ia - > io - > iocb - > ki_filp ;
2009-04-28 16:56:37 +02:00
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2007-10-18 03:07:04 -07:00
2019-09-10 15:04:10 +02:00
fuse_read_args_fill ( ia , file , pos , count , FUSE_READ ) ;
2007-10-18 03:07:04 -07:00
if ( owner ! = NULL ) {
2019-09-10 15:04:10 +02:00
ia - > read . in . read_flags | = FUSE_READ_LOCKOWNER ;
2020-05-06 17:44:12 +02:00
ia - > read . in . lock_owner = fuse_lock_owner_id ( fm - > fc , owner ) ;
2007-10-18 03:07:04 -07:00
}
2012-12-14 19:20:51 +04:00
2019-09-10 15:04:10 +02:00
if ( ia - > io - > async )
2020-05-06 17:44:12 +02:00
return fuse_async_req_send ( fm , ia , count ) ;
2012-12-14 19:20:51 +04:00
2020-05-06 17:44:12 +02:00
return fuse_simple_request ( fm , & ia - > ap . args ) ;
2005-09-09 13:10:36 -07:00
}
2008-04-30 00:54:43 -07:00
static void fuse_read_update_size ( struct inode * inode , loff_t size ,
u64 attr_ver )
{
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2021-10-22 17:03:03 +02:00
if ( attr_ver > = fi - > attr_version & & size < inode - > i_size & &
fuse: hotfix truncate_pagecache() issue
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-08-30 17:06:04 +04:00
! test_bit ( FUSE_I_SIZE_UNSTABLE , & fi - > state ) ) {
2018-11-09 13:33:17 +03:00
fi - > attr_version = atomic64_inc_return ( & fc - > attr_version ) ;
2008-04-30 00:54:43 -07:00
i_size_write ( inode , size ) ;
}
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2008-04-30 00:54:43 -07:00
}
2019-09-10 15:04:09 +02:00
static void fuse_short_read ( struct inode * inode , u64 attr_ver , size_t num_read ,
2019-09-10 15:04:10 +02:00
struct fuse_args_pages * ap )
2013-10-10 17:10:16 +04:00
{
2013-10-10 17:10:46 +04:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2021-04-14 10:40:56 +02:00
/*
* If writeback_cache is enabled , a short read means there ' s a hole in
* the file . Some data after the hole is in page cache , but has not
* reached the client fs yet . So the hole is not present there .
*/
if ( ! fc - > writeback_cache ) {
2019-09-10 15:04:10 +02:00
loff_t pos = page_offset ( ap - > pages [ 0 ] ) + num_read ;
2013-10-10 17:10:46 +04:00
fuse_read_update_size ( inode , pos , attr_ver ) ;
}
2013-10-10 17:10:16 +04:00
}
2013-10-10 17:11:25 +04:00
static int fuse_do_readpage ( struct file * file , struct page * page )
2005-09-09 13:10:30 -07:00
{
struct inode * inode = page - > mapping - > host ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2008-04-30 00:54:43 -07:00
loff_t pos = page_offset ( page ) ;
2019-09-10 15:04:09 +02:00
struct fuse_page_desc desc = { . length = PAGE_SIZE } ;
struct fuse_io_args ia = {
. ap . args . page_zeroing = true ,
. ap . args . out_pages = true ,
. ap . num_pages = 1 ,
. ap . pages = & page ,
. ap . descs = & desc ,
} ;
ssize_t res ;
2008-04-30 00:54:43 -07:00
u64 attr_ver ;
2006-01-06 00:19:39 -08:00
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
/*
2011-03-30 22:57:33 -03:00
* Page writeback can extend beyond the lifetime of the
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
* page - cache page , so make sure we read a properly synced
* page .
*/
fuse_wait_on_page_writeback ( inode , page - > index ) ;
2020-05-06 17:44:12 +02:00
attr_ver = fuse_get_attr_version ( fm - > fc ) ;
2008-04-30 00:54:43 -07:00
2020-02-06 16:39:28 +01:00
/* Don't overflow end offset */
if ( pos + ( desc . length - 1 ) = = LLONG_MAX )
desc . length - - ;
2019-09-10 15:04:09 +02:00
fuse_read_args_fill ( & ia , file , pos , desc . length , FUSE_READ ) ;
2020-05-06 17:44:12 +02:00
res = fuse_simple_request ( fm , & ia . ap . args ) ;
2019-09-10 15:04:09 +02:00
if ( res < 0 )
return res ;
/*
* Short read means EOF . If file size is larger , truncate it
*/
if ( res < desc . length )
2019-09-10 15:04:10 +02:00
fuse_short_read ( inode , attr_ver , res , & ia . ap ) ;
2008-04-30 00:54:43 -07:00
2019-09-10 15:04:09 +02:00
SetPageUptodate ( page ) ;
2013-10-10 17:11:25 +04:00
2019-09-10 15:04:09 +02:00
return 0 ;
2013-10-10 17:11:25 +04:00
}
2022-04-29 11:12:16 -04:00
static int fuse_read_folio ( struct file * file , struct folio * folio )
2013-10-10 17:11:25 +04:00
{
2022-04-29 11:12:16 -04:00
struct page * page = & folio - > page ;
2013-10-10 17:11:25 +04:00
struct inode * inode = page - > mapping - > host ;
int err ;
err = - EIO ;
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2013-10-10 17:11:25 +04:00
goto out ;
err = fuse_do_readpage ( file , page ) ;
2013-11-05 03:55:43 -08:00
fuse_invalidate_atime ( inode ) ;
2005-09-09 13:10:30 -07:00
out :
unlock_page ( page ) ;
return err ;
}
2020-05-06 17:44:12 +02:00
static void fuse_readpages_end ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 15:04:10 +02:00
int err )
2005-09-09 13:10:33 -07:00
{
2006-01-16 22:14:46 -08:00
int i ;
2019-09-10 15:04:10 +02:00
struct fuse_io_args * ia = container_of ( args , typeof ( * ia ) , ap . args ) ;
struct fuse_args_pages * ap = & ia - > ap ;
size_t count = ia - > read . in . size ;
size_t num_read = args - > out_args [ 0 ] . size ;
2010-05-25 15:06:07 +02:00
struct address_space * mapping = NULL ;
2006-01-16 22:14:46 -08:00
2019-09-10 15:04:10 +02:00
for ( i = 0 ; mapping = = NULL & & i < ap - > num_pages ; i + + )
mapping = ap - > pages [ i ] - > mapping ;
2008-04-30 00:54:43 -07:00
2010-05-25 15:06:07 +02:00
if ( mapping ) {
struct inode * inode = mapping - > host ;
/*
* Short read means EOF . If file size is larger , truncate it
*/
2019-09-10 15:04:10 +02:00
if ( ! err & & num_read < count )
fuse_short_read ( inode , ia - > read . attr_ver , num_read , ap ) ;
2010-05-25 15:06:07 +02:00
2013-11-05 03:55:43 -08:00
fuse_invalidate_atime ( inode ) ;
2010-05-25 15:06:07 +02:00
}
2006-01-16 22:14:46 -08:00
2019-09-10 15:04:10 +02:00
for ( i = 0 ; i < ap - > num_pages ; i + + ) {
struct page * page = ap - > pages [ i ] ;
if ( ! err )
2005-09-09 13:10:33 -07:00
SetPageUptodate ( page ) ;
2006-01-16 22:14:46 -08:00
else
SetPageError ( page ) ;
2005-09-09 13:10:33 -07:00
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2005-09-09 13:10:33 -07:00
}
2019-09-10 15:04:10 +02:00
if ( ia - > ff )
fuse_file_put ( ia - > ff , false , false ) ;
fuse_io_free ( ia ) ;
2006-01-16 22:14:46 -08:00
}
2019-09-10 15:04:10 +02:00
static void fuse_send_readpages ( struct fuse_io_args * ia , struct file * file )
2006-01-16 22:14:46 -08:00
{
2009-04-28 16:56:37 +02:00
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2019-09-10 15:04:10 +02:00
struct fuse_args_pages * ap = & ia - > ap ;
loff_t pos = page_offset ( ap - > pages [ 0 ] ) ;
size_t count = ap - > num_pages < < PAGE_SHIFT ;
2020-01-16 11:09:36 +01:00
ssize_t res ;
2019-09-10 15:04:10 +02:00
int err ;
ap - > args . out_pages = true ;
ap - > args . page_zeroing = true ;
ap - > args . page_replace = true ;
2020-02-06 16:39:28 +01:00
/* Don't overflow end offset */
if ( pos + ( count - 1 ) = = LLONG_MAX ) {
count - - ;
ap - > descs [ ap - > num_pages - 1 ] . length - - ;
}
WARN_ON ( ( loff_t ) ( pos + count ) < 0 ) ;
2019-09-10 15:04:10 +02:00
fuse_read_args_fill ( ia , file , pos , count , FUSE_READ ) ;
2020-05-06 17:44:12 +02:00
ia - > read . attr_ver = fuse_get_attr_version ( fm - > fc ) ;
if ( fm - > fc - > async_read ) {
2019-09-10 15:04:10 +02:00
ia - > ff = fuse_file_get ( ff ) ;
ap - > args . end = fuse_readpages_end ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_background ( fm , & ap - > args , GFP_KERNEL ) ;
2019-09-10 15:04:10 +02:00
if ( ! err )
return ;
2006-02-01 03:04:40 -08:00
} else {
2020-05-06 17:44:12 +02:00
res = fuse_simple_request ( fm , & ap - > args ) ;
2020-01-16 11:09:36 +01:00
err = res < 0 ? res : 0 ;
2006-02-01 03:04:40 -08:00
}
2020-05-06 17:44:12 +02:00
fuse_readpages_end ( fm , & ap - > args , err ) ;
2005-09-09 13:10:33 -07:00
}
2020-06-01 21:47:31 -07:00
static void fuse_readahead ( struct readahead_control * rac )
2005-09-09 13:10:33 -07:00
{
2020-06-01 21:47:31 -07:00
struct inode * inode = rac - > mapping - > host ;
2005-09-09 13:10:33 -07:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2020-06-01 21:47:31 -07:00
unsigned int i , max_pages , nr_pages = 0 ;
2005-09-09 13:10:33 -07:00
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2020-06-01 21:47:31 -07:00
return ;
2006-01-06 00:19:39 -08:00
2020-06-01 21:47:31 -07:00
max_pages = min_t ( unsigned int , fc - > max_pages ,
fc - > max_read / PAGE_SIZE ) ;
2005-09-09 13:10:33 -07:00
2020-06-01 21:47:31 -07:00
for ( ; ; ) {
struct fuse_io_args * ia ;
struct fuse_args_pages * ap ;
2022-03-22 14:38:58 -07:00
if ( fc - > num_background > = fc - > congestion_threshold & &
rac - > ra - > async_size > = readahead_count ( rac ) )
/*
* Congested and only async pages left , so skip the
* rest .
*/
break ;
2020-06-01 21:47:31 -07:00
nr_pages = readahead_count ( rac ) - nr_pages ;
if ( nr_pages > max_pages )
nr_pages = max_pages ;
if ( nr_pages = = 0 )
break ;
ia = fuse_io_alloc ( NULL , nr_pages ) ;
if ( ! ia )
return ;
ap = & ia - > ap ;
nr_pages = __readahead_batch ( rac , ap - > pages , nr_pages ) ;
for ( i = 0 ; i < nr_pages ; i + + ) {
fuse_wait_on_page_writeback ( inode ,
readahead_index ( rac ) + i ) ;
ap - > descs [ i ] . length = PAGE_SIZE ;
}
ap - > num_pages = nr_pages ;
fuse_send_readpages ( ia , rac - > file ) ;
2006-04-10 22:54:49 -07:00
}
2005-09-09 13:10:33 -07:00
}
2019-01-24 10:40:17 +01:00
static ssize_t fuse_cache_read_iter ( struct kiocb * iocb , struct iov_iter * to )
2007-11-28 16:21:59 -08:00
{
struct inode * inode = iocb - > ki_filp - > f_mapping - > host ;
2012-07-16 15:23:50 -04:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2007-11-28 16:21:59 -08:00
2012-07-16 15:23:50 -04:00
/*
* In auto invalidate mode , always update attributes on read .
* Otherwise , only update if we attempt to read past EOF ( to ensure
* i_size is up to date ) .
*/
if ( fc - > auto_inval_data | |
2014-04-02 14:47:09 -04:00
( iocb - > ki_pos + iov_iter_count ( to ) > i_size_read ( inode ) ) ) {
2007-11-28 16:21:59 -08:00
int err ;
2021-10-22 17:03:03 +02:00
err = fuse_update_attributes ( inode , iocb - > ki_filp , STATX_SIZE ) ;
2007-11-28 16:21:59 -08:00
if ( err )
return err ;
}
2014-04-02 14:47:09 -04:00
return generic_file_read_iter ( iocb , to ) ;
2007-11-28 16:21:59 -08:00
}
2019-09-10 15:04:09 +02:00
static void fuse_write_args_fill ( struct fuse_io_args * ia , struct fuse_file * ff ,
loff_t pos , size_t count )
{
struct fuse_args * args = & ia - > ap . args ;
ia - > write . in . fh = ff - > fh ;
ia - > write . in . offset = pos ;
ia - > write . in . size = count ;
args - > opcode = FUSE_WRITE ;
args - > nodeid = ff - > nodeid ;
args - > in_numargs = 2 ;
2020-05-06 17:44:12 +02:00
if ( ff - > fm - > fc - > minor < 9 )
2019-09-10 15:04:09 +02:00
args - > in_args [ 0 ] . size = FUSE_COMPAT_WRITE_IN_SIZE ;
else
args - > in_args [ 0 ] . size = sizeof ( ia - > write . in ) ;
args - > in_args [ 0 ] . value = & ia - > write . in ;
args - > in_args [ 1 ] . size = count ;
args - > out_numargs = 1 ;
args - > out_args [ 0 ] . size = sizeof ( ia - > write . out ) ;
args - > out_args [ 0 ] . value = & ia - > write . out ;
}
static unsigned int fuse_write_flags ( struct kiocb * iocb )
{
unsigned int flags = iocb - > ki_filp - > f_flags ;
2022-05-22 09:39:27 -04:00
if ( iocb_is_dsync ( iocb ) )
2019-09-10 15:04:09 +02:00
flags | = O_DSYNC ;
if ( iocb - > ki_flags & IOCB_SYNC )
flags | = O_SYNC ;
return flags ;
}
2019-09-10 15:04:10 +02:00
static ssize_t fuse_send_write ( struct fuse_io_args * ia , loff_t pos ,
size_t count , fl_owner_t owner )
2007-10-18 03:07:03 -07:00
{
2019-09-10 15:04:10 +02:00
struct kiocb * iocb = ia - > io - > iocb ;
2017-09-12 16:57:53 +02:00
struct file * file = iocb - > ki_filp ;
2009-04-28 16:56:37 +02:00
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2019-09-10 15:04:10 +02:00
struct fuse_write_in * inarg = & ia - > write . in ;
ssize_t err ;
2009-04-28 16:56:36 +02:00
2019-09-10 15:04:10 +02:00
fuse_write_args_fill ( ia , ff , pos , count ) ;
2019-09-10 15:04:09 +02:00
inarg - > flags = fuse_write_flags ( iocb ) ;
2007-10-18 03:07:04 -07:00
if ( owner ! = NULL ) {
inarg - > write_flags | = FUSE_WRITE_LOCKOWNER ;
2020-05-06 17:44:12 +02:00
inarg - > lock_owner = fuse_lock_owner_id ( fm - > fc , owner ) ;
2007-10-18 03:07:04 -07:00
}
2012-12-14 19:20:51 +04:00
2019-09-10 15:04:10 +02:00
if ( ia - > io - > async )
2020-05-06 17:44:12 +02:00
return fuse_async_req_send ( fm , ia , count ) ;
2019-09-10 15:04:10 +02:00
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & ia - > ap . args ) ;
2019-09-10 15:04:10 +02:00
if ( ! err & & ia - > write . out . size > count )
err = - EIO ;
2012-12-14 19:20:51 +04:00
2019-09-10 15:04:10 +02:00
return err ? : ia - > write . out . size ;
2005-09-09 13:10:30 -07:00
}
2021-10-22 17:03:02 +02:00
bool fuse_write_update_attr ( struct inode * inode , loff_t pos , ssize_t written )
2008-04-30 00:54:41 -07:00
{
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2013-12-26 19:51:11 +04:00
bool ret = false ;
2008-04-30 00:54:41 -07:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2018-11-09 13:33:17 +03:00
fi - > attr_version = atomic64_inc_return ( & fc - > attr_version ) ;
2021-10-22 17:03:02 +02:00
if ( written > 0 & & pos > inode - > i_size ) {
2008-04-30 00:54:41 -07:00
i_size_write ( inode , pos ) ;
2013-12-26 19:51:11 +04:00
ret = true ;
}
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2013-12-26 19:51:11 +04:00
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , FUSE_STATX_MODSIZE ) ;
2013-12-26 19:51:11 +04:00
return ret ;
2008-04-30 00:54:41 -07:00
}
2019-09-10 15:04:09 +02:00
static ssize_t fuse_send_write_pages ( struct fuse_io_args * ia ,
struct kiocb * iocb , struct inode * inode ,
loff_t pos , size_t count )
2008-04-30 00:54:42 -07:00
{
2019-09-10 15:04:09 +02:00
struct fuse_args_pages * ap = & ia - > ap ;
struct file * file = iocb - > ki_filp ;
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2019-09-10 15:04:09 +02:00
unsigned int offset , i ;
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
bool short_write ;
2019-09-10 15:04:09 +02:00
int err ;
2008-04-30 00:54:42 -07:00
2019-09-10 15:04:09 +02:00
for ( i = 0 ; i < ap - > num_pages ; i + + )
fuse_wait_on_page_writeback ( inode , ap - > pages [ i ] - > index ) ;
2008-04-30 00:54:42 -07:00
2019-09-10 15:04:09 +02:00
fuse_write_args_fill ( ia , ff , pos , count ) ;
ia - > write . in . flags = fuse_write_flags ( iocb ) ;
2020-10-09 14:15:08 -04:00
if ( fm - > fc - > handle_killpriv_v2 & & ! capable ( CAP_FSETID ) )
ia - > write . in . write_flags | = FUSE_WRITE_KILL_SUIDGID ;
2008-04-30 00:54:42 -07:00
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & ap - > args ) ;
2019-11-12 11:49:04 +01:00
if ( ! err & & ia - > write . out . size > count )
err = - EIO ;
2019-09-10 15:04:09 +02:00
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
short_write = ia - > write . out . size < count ;
2019-09-10 15:04:09 +02:00
offset = ap - > descs [ 0 ] . offset ;
count = ia - > write . out . size ;
for ( i = 0 ; i < ap - > num_pages ; i + + ) {
struct page * page = ap - > pages [ i ] ;
2008-04-30 00:54:42 -07:00
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
if ( err ) {
ClearPageUptodate ( page ) ;
} else {
if ( count > = PAGE_SIZE - offset )
count - = PAGE_SIZE - offset ;
else {
if ( short_write )
ClearPageUptodate ( page ) ;
count = 0 ;
}
offset = 0 ;
}
if ( ia - > write . page_locked & & ( i = = ap - > num_pages - 1 ) )
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2008-04-30 00:54:42 -07:00
}
2019-09-10 15:04:09 +02:00
return err ;
2008-04-30 00:54:42 -07:00
}
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
static ssize_t fuse_fill_write_pages ( struct fuse_io_args * ia ,
2019-09-10 15:04:09 +02:00
struct address_space * mapping ,
struct iov_iter * ii , loff_t pos ,
unsigned int max_pages )
2008-04-30 00:54:42 -07:00
{
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
struct fuse_args_pages * ap = & ia - > ap ;
2008-04-30 00:54:42 -07:00
struct fuse_conn * fc = get_fuse_conn ( mapping - > host ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
unsigned offset = pos & ( PAGE_SIZE - 1 ) ;
2008-04-30 00:54:42 -07:00
size_t count = 0 ;
int err ;
2019-09-10 15:04:09 +02:00
ap - > args . in_pages = true ;
ap - > descs [ 0 ] . offset = offset ;
2008-04-30 00:54:42 -07:00
do {
size_t tmp ;
struct page * page ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
pgoff_t index = pos > > PAGE_SHIFT ;
size_t bytes = min_t ( size_t , PAGE_SIZE - offset ,
2008-04-30 00:54:42 -07:00
iov_iter_count ( ii ) ) ;
bytes = min_t ( size_t , bytes , fc - > max_write - count ) ;
again :
err = - EFAULT ;
2021-08-02 14:54:16 +02:00
if ( fault_in_iov_iter_readable ( ii , bytes ) )
2008-04-30 00:54:42 -07:00
break ;
err = - ENOMEM ;
2022-02-22 11:25:12 -05:00
page = grab_cache_page_write_begin ( mapping , index ) ;
2008-04-30 00:54:42 -07:00
if ( ! page )
break ;
mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 13:44:02 -08:00
if ( mapping_writably_mapped ( mapping ) )
flush_dcache_page ( page ) ;
2021-04-30 10:26:41 -04:00
tmp = copy_page_from_iter_atomic ( page , offset , bytes , ii ) ;
2008-04-30 00:54:42 -07:00
flush_dcache_page ( page ) ;
if ( ! tmp ) {
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2008-04-30 00:54:42 -07:00
goto again ;
}
err = 0 ;
2019-09-10 15:04:09 +02:00
ap - > pages [ ap - > num_pages ] = page ;
ap - > descs [ ap - > num_pages ] . length = tmp ;
ap - > num_pages + + ;
2008-04-30 00:54:42 -07:00
count + = tmp ;
pos + = tmp ;
offset + = tmp ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
if ( offset = = PAGE_SIZE )
2008-04-30 00:54:42 -07:00
offset = 0 ;
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
/* If we copied full page, mark it uptodate */
if ( tmp = = PAGE_SIZE )
SetPageUptodate ( page ) ;
if ( PageUptodate ( page ) ) {
unlock_page ( page ) ;
} else {
ia - > write . page_locked = true ;
break ;
}
2008-05-12 14:02:32 -07:00
if ( ! fc - > big_writes )
break ;
2008-04-30 00:54:42 -07:00
} while ( iov_iter_count ( ii ) & & count < fc - > max_write & &
2019-09-10 15:04:09 +02:00
ap - > num_pages < max_pages & & offset = = 0 ) ;
2008-04-30 00:54:42 -07:00
return count > 0 ? count : err ;
}
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
static inline unsigned int fuse_wr_pages ( loff_t pos , size_t len ,
unsigned int max_pages )
2012-10-26 19:49:00 +04:00
{
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
return min_t ( unsigned int ,
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
( ( pos + len - 1 ) > > PAGE_SHIFT ) -
( pos > > PAGE_SHIFT ) + 1 ,
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
max_pages ) ;
2012-10-26 19:49:00 +04:00
}
2023-06-01 16:59:03 +02:00
static ssize_t fuse_perform_write ( struct kiocb * iocb , struct iov_iter * ii )
2008-04-30 00:54:42 -07:00
{
2023-06-01 16:59:03 +02:00
struct address_space * mapping = iocb - > ki_filp - > f_mapping ;
2008-04-30 00:54:42 -07:00
struct inode * inode = mapping - > host ;
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
fuse: hotfix truncate_pagecache() issue
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-08-30 17:06:04 +04:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2023-06-01 16:59:03 +02:00
loff_t pos = iocb - > ki_pos ;
2008-04-30 00:54:42 -07:00
int err = 0 ;
ssize_t res = 0 ;
fuse: hotfix truncate_pagecache() issue
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-08-30 17:06:04 +04:00
if ( inode - > i_size < pos + iov_iter_count ( ii ) )
set_bit ( FUSE_I_SIZE_UNSTABLE , & fi - > state ) ;
2008-04-30 00:54:42 -07:00
do {
ssize_t count ;
2019-09-10 15:04:09 +02:00
struct fuse_io_args ia = { } ;
struct fuse_args_pages * ap = & ia . ap ;
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
unsigned int nr_pages = fuse_wr_pages ( pos , iov_iter_count ( ii ) ,
fc - > max_pages ) ;
2008-04-30 00:54:42 -07:00
2019-09-10 15:04:09 +02:00
ap - > pages = fuse_pages_alloc ( nr_pages , GFP_KERNEL , & ap - > descs ) ;
if ( ! ap - > pages ) {
err = - ENOMEM ;
2008-04-30 00:54:42 -07:00
break ;
}
fuse: fix write deadlock
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-21 16:12:49 -04:00
count = fuse_fill_write_pages ( & ia , mapping , ii , pos , nr_pages ) ;
2008-04-30 00:54:42 -07:00
if ( count < = 0 ) {
err = count ;
} else {
2019-09-10 15:04:09 +02:00
err = fuse_send_write_pages ( & ia , iocb , inode ,
pos , count ) ;
2008-04-30 00:54:42 -07:00
if ( ! err ) {
2019-09-10 15:04:09 +02:00
size_t num_written = ia . write . out . size ;
2008-04-30 00:54:42 -07:00
res + = num_written ;
pos + = num_written ;
/* break out of the loop on short write */
if ( num_written ! = count )
err = - EIO ;
}
}
2019-09-10 15:04:09 +02:00
kfree ( ap - > pages ) ;
2008-04-30 00:54:42 -07:00
} while ( ! err & & iov_iter_count ( ii ) ) ;
2021-10-22 17:03:02 +02:00
fuse_write_update_attr ( inode , pos , res ) ;
fuse: hotfix truncate_pagecache() issue
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-08-30 17:06:04 +04:00
clear_bit ( FUSE_I_SIZE_UNSTABLE , & fi - > state ) ;
2008-04-30 00:54:42 -07:00
2023-06-01 16:59:02 +02:00
if ( ! res )
return err ;
iocb - > ki_pos + = res ;
return res ;
2008-04-30 00:54:42 -07:00
}
2019-01-24 10:40:17 +01:00
static ssize_t fuse_cache_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2008-04-30 00:54:42 -07:00
{
struct file * file = iocb - > ki_filp ;
struct address_space * mapping = file - > f_mapping ;
ssize_t written = 0 ;
struct inode * inode = mapping - > host ;
ssize_t err ;
2020-10-09 14:15:10 -04:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2008-04-30 00:54:42 -07:00
2020-10-09 14:15:10 -04:00
if ( fc - > writeback_cache ) {
2013-10-10 17:12:18 +04:00
/* Update size (EOF optimization) and mode (SUID clearing) */
2021-10-22 17:03:03 +02:00
err = fuse_update_attributes ( mapping - > host , file ,
STATX_SIZE | STATX_MODE ) ;
2013-10-10 17:12:18 +04:00
if ( err )
return err ;
2020-10-09 14:15:10 -04:00
if ( fc - > handle_killpriv_v2 & &
2023-01-13 12:49:27 +01:00
setattr_should_drop_suidgid ( & nop_mnt_idmap ,
file_inode ( file ) ) ) {
2020-10-09 14:15:10 -04:00
goto writethrough ;
}
2014-04-03 14:33:23 -04:00
return generic_file_write_iter ( iocb , from ) ;
2013-10-10 17:12:18 +04:00
}
2020-10-09 14:15:10 -04:00
writethrough :
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2008-04-30 00:54:42 -07:00
2015-04-09 12:55:47 -04:00
err = generic_write_checks ( iocb , from ) ;
if ( err < = 0 )
2008-04-30 00:54:42 -07:00
goto out ;
2015-05-21 16:05:53 +02:00
err = file_remove_privs ( file ) ;
2008-04-30 00:54:42 -07:00
if ( err )
goto out ;
2012-03-26 09:59:21 -04:00
err = file_update_time ( file ) ;
if ( err )
goto out ;
2008-04-30 00:54:42 -07:00
2015-04-09 13:52:01 -04:00
if ( iocb - > ki_flags & IOCB_DIRECT ) {
2016-04-07 08:51:56 -07:00
written = generic_file_direct_write ( iocb , from ) ;
2014-04-03 14:33:23 -04:00
if ( written < 0 | | ! iov_iter_count ( from ) )
2012-02-17 12:46:25 -05:00
goto out ;
2023-06-01 16:59:04 +02:00
written = direct_write_fallback ( iocb , from , written ,
fuse_perform_write ( iocb , from ) ) ;
2012-02-17 12:46:25 -05:00
} else {
2023-06-01 16:59:03 +02:00
written = fuse_perform_write ( iocb , from ) ;
2012-02-17 12:46:25 -05:00
}
2008-04-30 00:54:42 -07:00
out :
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2017-09-12 16:57:53 +02:00
if ( written > 0 )
written = generic_write_sync ( iocb , written ) ;
2008-04-30 00:54:42 -07:00
return written ? written : err ;
}
2012-10-26 19:50:29 +04:00
static inline unsigned long fuse_get_user_addr ( const struct iov_iter * ii )
{
2023-03-29 08:52:15 -06:00
return ( unsigned long ) iter_iov ( ii ) - > iov_base + ii - > iov_offset ;
2012-10-26 19:50:29 +04:00
}
static inline size_t fuse_get_frag_size ( const struct iov_iter * ii ,
size_t max_size )
{
return min ( iov_iter_single_seg_count ( ii ) , max_size ) ;
}
2019-09-10 15:04:10 +02:00
static int fuse_get_user_pages ( struct fuse_args_pages * ap , struct iov_iter * ii ,
size_t * nbytesp , int write ,
unsigned int max_pages )
2005-09-09 13:10:35 -07:00
{
2012-10-26 19:50:29 +04:00
size_t nbytes = 0 ; /* # bytes already packed in req */
2016-03-14 21:57:35 -07:00
ssize_t ret = 0 ;
2012-10-26 19:50:15 +04:00
2009-04-02 14:25:34 +02:00
/* Special case for kernel I/O: can copy directly into the buffer */
2018-10-22 13:07:28 +01:00
if ( iov_iter_is_kvec ( ii ) ) {
2012-10-26 19:50:29 +04:00
unsigned long user_addr = fuse_get_user_addr ( ii ) ;
size_t frag_size = fuse_get_frag_size ( ii , * nbytesp ) ;
2009-04-02 14:25:34 +02:00
if ( write )
2019-09-10 15:04:10 +02:00
ap - > args . in_args [ 1 ] . value = ( void * ) user_addr ;
2009-04-02 14:25:34 +02:00
else
2019-09-10 15:04:10 +02:00
ap - > args . out_args [ 0 ] . value = ( void * ) user_addr ;
2009-04-02 14:25:34 +02:00
2012-10-26 19:50:15 +04:00
iov_iter_advance ( ii , frag_size ) ;
* nbytesp = frag_size ;
2009-04-02 14:25:34 +02:00
return 0 ;
}
2005-09-09 13:10:35 -07:00
2019-09-10 15:04:10 +02:00
while ( nbytes < * nbytesp & & ap - > num_pages < max_pages ) {
2012-10-26 19:50:29 +04:00
unsigned npages ;
2014-03-19 01:16:16 -04:00
size_t start ;
2022-06-09 10:28:36 -04:00
ret = iov_iter_get_pages2 ( ii , & ap - > pages [ ap - > num_pages ] ,
2014-09-24 17:09:11 +02:00
* nbytesp - nbytes ,
2019-09-10 15:04:10 +02:00
max_pages - ap - > num_pages ,
2014-06-18 20:34:33 -04:00
& start ) ;
2012-10-26 19:50:29 +04:00
if ( ret < 0 )
2016-03-14 21:57:35 -07:00
break ;
2012-10-26 19:50:29 +04:00
2014-03-16 16:08:30 -04:00
nbytes + = ret ;
2012-10-26 19:50:29 +04:00
2014-03-16 16:08:30 -04:00
ret + = start ;
2021-05-25 15:40:47 +08:00
npages = DIV_ROUND_UP ( ret , PAGE_SIZE ) ;
2012-10-26 19:50:29 +04:00
2019-09-10 15:04:10 +02:00
ap - > descs [ ap - > num_pages ] . offset = start ;
fuse_page_descs_length_init ( ap - > descs , ap - > num_pages , npages ) ;
2012-10-26 19:50:29 +04:00
2019-09-10 15:04:10 +02:00
ap - > num_pages + = npages ;
ap - > descs [ ap - > num_pages - 1 ] . length - =
2014-03-16 16:08:30 -04:00
( PAGE_SIZE - ret ) & ( PAGE_SIZE - 1 ) ;
2012-10-26 19:50:29 +04:00
}
2009-04-02 14:25:34 +02:00
2022-03-07 16:30:44 +01:00
ap - > args . user_pages = true ;
2009-04-02 14:25:34 +02:00
if ( write )
2020-01-14 20:39:45 +08:00
ap - > args . in_pages = true ;
2009-04-02 14:25:34 +02:00
else
2020-01-14 20:39:45 +08:00
ap - > args . out_pages = true ;
2009-04-02 14:25:34 +02:00
2012-10-26 19:50:29 +04:00
* nbytesp = nbytes ;
2009-04-02 14:25:34 +02:00
2016-03-25 10:53:41 -07:00
return ret < 0 ? ret : 0 ;
2005-09-09 13:10:35 -07:00
}
2014-03-16 15:50:47 -04:00
ssize_t fuse_direct_io ( struct fuse_io_priv * io , struct iov_iter * iter ,
loff_t * ppos , int flags )
2005-09-09 13:10:35 -07:00
{
2013-10-10 17:12:05 +04:00
int write = flags & FUSE_DIO_WRITE ;
int cuse = flags & FUSE_DIO_CUSE ;
2017-09-12 16:57:53 +02:00
struct file * file = io - > iocb - > ki_filp ;
2013-10-10 17:12:05 +04:00
struct inode * inode = file - > f_mapping - > host ;
2009-04-28 16:56:37 +02:00
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_conn * fc = ff - > fm - > fc ;
2005-09-09 13:10:35 -07:00
size_t nmax = write ? fc - > max_write : fc - > max_read ;
loff_t pos = * ppos ;
2014-03-16 15:50:47 -04:00
size_t count = iov_iter_count ( iter ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
pgoff_t idx_from = pos > > PAGE_SHIFT ;
pgoff_t idx_to = ( pos + count - 1 ) > > PAGE_SHIFT ;
2005-09-09 13:10:35 -07:00
ssize_t res = 0 ;
2016-03-14 21:57:35 -07:00
int err = 0 ;
2019-09-10 15:04:10 +02:00
struct fuse_io_args * ia ;
unsigned int max_pages ;
2006-01-06 00:19:39 -08:00
2019-09-10 15:04:10 +02:00
max_pages = iov_iter_npages ( iter , fc - > max_pages ) ;
ia = fuse_io_alloc ( io , max_pages ) ;
if ( ! ia )
return - ENOMEM ;
2005-09-09 13:10:35 -07:00
2013-10-10 17:12:05 +04:00
if ( ! cuse & & fuse_range_is_writeback ( inode , idx_from , idx_to ) ) {
if ( ! write )
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2013-10-10 17:12:05 +04:00
fuse_sync_writes ( inode ) ;
if ( ! write )
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2013-10-10 17:12:05 +04:00
}
2022-05-22 14:59:25 -04:00
io - > should_dirty = ! write & & user_backed_iter ( iter ) ;
2005-09-09 13:10:35 -07:00
while ( count ) {
2019-09-10 15:04:10 +02:00
ssize_t nres ;
2009-04-28 16:56:37 +02:00
fl_owner_t owner = current - > files ;
2009-04-02 14:25:34 +02:00
size_t nbytes = min ( count , nmax ) ;
2019-09-10 15:04:10 +02:00
err = fuse_get_user_pages ( & ia - > ap , iter , & nbytes , write ,
max_pages ) ;
2016-03-14 21:57:35 -07:00
if ( err & & ! nbytes )
2005-09-09 13:10:35 -07:00
break ;
2009-04-02 14:25:34 +02:00
2019-05-27 09:08:12 +02:00
if ( write ) {
2019-09-10 15:04:10 +02:00
if ( ! capable ( CAP_FSETID ) )
2020-11-11 17:22:32 +01:00
ia - > write . in . write_flags | = FUSE_WRITE_KILL_SUIDGID ;
2019-05-27 09:08:12 +02:00
2019-09-10 15:04:10 +02:00
nres = fuse_send_write ( ia , pos , nbytes , owner ) ;
2019-05-27 09:08:12 +02:00
} else {
2019-09-10 15:04:10 +02:00
nres = fuse_send_read ( ia , pos , nbytes , owner ) ;
2019-05-27 09:08:12 +02:00
}
2009-04-28 16:56:37 +02:00
2019-09-10 15:04:10 +02:00
if ( ! io - > async | | nres < 0 ) {
fuse_release_user_pages ( & ia - > ap , io - > should_dirty ) ;
fuse_io_free ( ia ) ;
}
ia = NULL ;
if ( nres < 0 ) {
2020-02-06 16:39:28 +01:00
iov_iter_revert ( iter , nbytes ) ;
2019-09-10 15:04:10 +02:00
err = nres ;
2005-09-09 13:10:35 -07:00
break ;
}
2019-09-10 15:04:10 +02:00
WARN_ON ( nres > nbytes ) ;
2005-09-09 13:10:35 -07:00
count - = nres ;
res + = nres ;
pos + = nres ;
2020-02-06 16:39:28 +01:00
if ( nres ! = nbytes ) {
iov_iter_revert ( iter , nbytes - nres ) ;
2005-09-09 13:10:35 -07:00
break ;
2020-02-06 16:39:28 +01:00
}
2006-04-11 21:16:51 +02:00
if ( count ) {
2019-09-10 15:04:10 +02:00
max_pages = iov_iter_npages ( iter , fc - > max_pages ) ;
ia = fuse_io_alloc ( io , max_pages ) ;
if ( ! ia )
2006-04-11 21:16:51 +02:00
break ;
}
2005-09-09 13:10:35 -07:00
}
2019-09-10 15:04:10 +02:00
if ( ia )
fuse_io_free ( ia ) ;
2009-04-28 16:56:36 +02:00
if ( res > 0 )
2005-09-09 13:10:35 -07:00
* ppos = pos ;
2016-03-14 21:57:35 -07:00
return res > 0 ? res : err ;
2005-09-09 13:10:35 -07:00
}
2009-04-14 10:54:53 +09:00
EXPORT_SYMBOL_GPL ( fuse_direct_io ) ;
2005-09-09 13:10:35 -07:00
2012-12-14 19:20:51 +04:00
static ssize_t __fuse_direct_read ( struct fuse_io_priv * io ,
2014-03-16 15:50:47 -04:00
struct iov_iter * iter ,
loff_t * ppos )
2005-09-09 13:10:35 -07:00
{
2009-04-28 16:56:36 +02:00
ssize_t res ;
2017-09-12 16:57:53 +02:00
struct inode * inode = file_inode ( io - > iocb - > ki_filp ) ;
2009-04-28 16:56:36 +02:00
2014-03-16 15:50:47 -04:00
res = fuse_direct_io ( io , iter , ppos , 0 ) ;
2009-04-28 16:56:36 +02:00
2018-10-15 15:43:06 +02:00
fuse_invalidate_atime ( inode ) ;
2009-04-28 16:56:36 +02:00
return res ;
2005-09-09 13:10:35 -07:00
}
2018-10-27 16:48:48 +00:00
static ssize_t fuse_direct_IO ( struct kiocb * iocb , struct iov_iter * iter ) ;
2015-03-30 22:08:36 -04:00
static ssize_t fuse_direct_read_iter ( struct kiocb * iocb , struct iov_iter * to )
2012-10-26 19:50:15 +04:00
{
2018-10-27 16:48:48 +00:00
ssize_t res ;
if ( ! is_sync_kiocb ( iocb ) & & iocb - > ki_flags & IOCB_DIRECT ) {
res = fuse_direct_IO ( iocb , to ) ;
} else {
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC ( iocb ) ;
res = __fuse_direct_read ( & io , to , & iocb - > ki_pos ) ;
}
return res ;
2012-10-26 19:50:15 +04:00
}
fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 12:40:27 +05:30
static bool fuse_direct_write_extending_i_size ( struct kiocb * iocb ,
struct iov_iter * iter )
{
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
return iocb - > ki_pos + iov_iter_count ( iter ) > i_size_read ( inode ) ;
}
2015-03-30 22:08:36 -04:00
static ssize_t fuse_direct_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2012-02-17 12:46:25 -05:00
{
2017-09-12 16:57:53 +02:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 12:40:27 +05:30
struct file * file = iocb - > ki_filp ;
struct fuse_file * ff = file - > private_data ;
2017-09-12 16:57:53 +02:00
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC ( iocb ) ;
2015-03-30 22:08:36 -04:00
ssize_t res ;
fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 12:40:27 +05:30
bool exclusive_lock =
! ( ff - > open_flags & FOPEN_PARALLEL_DIRECT_WRITES ) | |
iocb - > ki_flags & IOCB_APPEND | |
fuse_direct_write_extending_i_size ( iocb , from ) ;
/*
* Take exclusive lock if
* - Parallel direct writes are disabled - a user space decision
* - Parallel direct writes are enabled and i_size is being extended .
* This might not be needed at all , but needs further investigation .
*/
if ( exclusive_lock )
inode_lock ( inode ) ;
else {
inode_lock_shared ( inode ) ;
/* A race with truncate might have come up as the decision for
* the lock type was done without holding the lock , check again .
*/
if ( fuse_direct_write_extending_i_size ( iocb , from ) ) {
inode_unlock_shared ( inode ) ;
inode_lock ( inode ) ;
exclusive_lock = true ;
}
}
2012-02-17 12:46:25 -05:00
2015-04-09 12:55:47 -04:00
res = generic_write_checks ( iocb , from ) ;
2018-10-27 16:48:48 +00:00
if ( res > 0 ) {
if ( ! is_sync_kiocb ( iocb ) & & iocb - > ki_flags & IOCB_DIRECT ) {
res = fuse_direct_IO ( iocb , from ) ;
} else {
res = fuse_direct_io ( & io , from , & iocb - > ki_pos ,
FUSE_DIO_WRITE ) ;
2021-10-22 17:03:02 +02:00
fuse_write_update_attr ( inode , iocb - > ki_pos , res ) ;
2018-10-27 16:48:48 +00:00
}
}
fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 12:40:27 +05:30
if ( exclusive_lock )
inode_unlock ( inode ) ;
else
inode_unlock_shared ( inode ) ;
2012-02-17 12:46:25 -05:00
return res ;
}
2019-01-24 10:40:17 +01:00
static ssize_t fuse_file_read_iter ( struct kiocb * iocb , struct iov_iter * to )
{
2019-01-24 10:40:17 +01:00
struct file * file = iocb - > ki_filp ;
struct fuse_file * ff = file - > private_data ;
2020-08-19 18:19:51 -04:00
struct inode * inode = file_inode ( file ) ;
2019-01-24 10:40:17 +01:00
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2019-01-24 10:40:17 +01:00
return - EIO ;
2019-01-24 10:40:17 +01:00
2020-08-19 18:19:51 -04:00
if ( FUSE_IS_DAX ( inode ) )
return fuse_dax_read_iter ( iocb , to ) ;
2019-01-24 10:40:17 +01:00
if ( ! ( ff - > open_flags & FOPEN_DIRECT_IO ) )
return fuse_cache_read_iter ( iocb , to ) ;
else
return fuse_direct_read_iter ( iocb , to ) ;
}
static ssize_t fuse_file_write_iter ( struct kiocb * iocb , struct iov_iter * from )
{
2019-01-24 10:40:17 +01:00
struct file * file = iocb - > ki_filp ;
struct fuse_file * ff = file - > private_data ;
2020-08-19 18:19:51 -04:00
struct inode * inode = file_inode ( file ) ;
2019-01-24 10:40:17 +01:00
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2019-01-24 10:40:17 +01:00
return - EIO ;
2019-01-24 10:40:17 +01:00
2020-08-19 18:19:51 -04:00
if ( FUSE_IS_DAX ( inode ) )
return fuse_dax_write_iter ( iocb , from ) ;
2019-01-24 10:40:17 +01:00
if ( ! ( ff - > open_flags & FOPEN_DIRECT_IO ) )
return fuse_cache_write_iter ( iocb , from ) ;
else
return fuse_direct_write_iter ( iocb , from ) ;
}
2019-09-10 15:04:10 +02:00
static void fuse_writepage_free ( struct fuse_writepage_args * wpa )
2005-09-09 13:10:30 -07:00
{
2019-09-10 15:04:10 +02:00
struct fuse_args_pages * ap = & wpa - > ia . ap ;
2013-06-29 21:42:48 +04:00
int i ;
2021-09-01 12:39:02 +02:00
if ( wpa - > bucket )
fuse_sync_bucket_dec ( wpa - > bucket ) ;
2019-09-10 15:04:10 +02:00
for ( i = 0 ; i < ap - > num_pages ; i + + )
__free_page ( ap - > pages [ i ] ) ;
if ( wpa - > ia . ff )
fuse_file_put ( wpa - > ia . ff , false , false ) ;
2013-10-01 16:44:53 +02:00
2019-09-10 15:04:10 +02:00
kfree ( ap - > pages ) ;
kfree ( wpa ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
2020-05-06 17:44:12 +02:00
static void fuse_writepage_finish ( struct fuse_mount * fm ,
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
2019-09-10 15:04:10 +02:00
struct fuse_args_pages * ap = & wpa - > ia . ap ;
struct inode * inode = wpa - > inode ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2015-01-14 10:42:36 +01:00
struct backing_dev_info * bdi = inode_to_bdi ( inode ) ;
2013-06-29 21:42:48 +04:00
int i ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2019-09-10 15:04:10 +02:00
for ( i = 0 ; i < ap - > num_pages ; i + + ) {
2015-05-22 17:13:27 -04:00
dec_wb_stat ( & bdi - > wb , WB_WRITEBACK ) ;
2019-09-10 15:04:10 +02:00
dec_node_page_state ( ap - > pages [ i ] , NR_WRITEBACK_TEMP ) ;
2015-05-22 17:13:27 -04:00
wb_writeout_inc ( & bdi - > wb ) ;
2013-06-29 21:42:48 +04:00
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
wake_up ( & fi - > page_waitq ) ;
}
2018-11-09 13:33:22 +03:00
/* Called under fi->lock, may release and reacquire it */
2020-05-06 17:44:12 +02:00
static void fuse_send_writepage ( struct fuse_mount * fm ,
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa , loff_t size )
2018-11-09 13:33:22 +03:00
__releases ( fi - > lock )
__acquires ( fi - > lock )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * aux , * next ;
struct fuse_inode * fi = get_fuse_inode ( wpa - > inode ) ;
struct fuse_write_in * inarg = & wpa - > ia . write . in ;
struct fuse_args * args = & wpa - > ia . ap . args ;
__u64 data_size = wpa - > ia . ap . num_pages * PAGE_SIZE ;
int err ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2019-09-10 15:04:10 +02:00
fi - > writectr + + ;
2013-06-29 21:42:48 +04:00
if ( inarg - > offset + data_size < = size ) {
inarg - > size = data_size ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
} else if ( inarg - > offset < size ) {
2013-06-29 21:42:48 +04:00
inarg - > size = size - inarg - > offset ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
} else {
/* Got truncated off completely */
goto out_free ;
2005-09-09 13:10:30 -07:00
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2019-09-10 15:04:10 +02:00
args - > in_args [ 1 ] . size = inarg - > size ;
args - > force = true ;
args - > nocreds = true ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_background ( fm , args , GFP_ATOMIC ) ;
2019-09-10 15:04:10 +02:00
if ( err = = - ENOMEM ) {
spin_unlock ( & fi - > lock ) ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_background ( fm , args , GFP_NOFS | __GFP_NOFAIL ) ;
2019-09-10 15:04:10 +02:00
spin_lock ( & fi - > lock ) ;
}
2018-11-09 13:33:22 +03:00
/* Fails on broken connection only */
2019-09-10 15:04:10 +02:00
if ( unlikely ( err ) )
2018-11-09 13:33:22 +03:00
goto out_free ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
return ;
out_free :
2019-09-10 15:04:10 +02:00
fi - > writectr - - ;
2020-07-14 14:45:41 +02:00
rb_erase ( & wpa - > writepages_entry , & fi - > writepages ) ;
2020-05-06 17:44:12 +02:00
fuse_writepage_finish ( fm , wpa ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2019-01-24 10:40:15 +01:00
/* After fuse_writepage_finish() aux request list is private */
2019-09-10 15:04:10 +02:00
for ( aux = wpa - > next ; aux ; aux = next ) {
next = aux - > next ;
aux - > next = NULL ;
fuse_writepage_free ( aux ) ;
2019-01-24 10:40:15 +01:00
}
2019-09-10 15:04:10 +02:00
fuse_writepage_free ( wpa ) ;
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2005-09-09 13:10:30 -07:00
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
/*
* If fi - > writectr is positive ( no truncate or fsync going on ) send
* all queued writepage requests .
*
2018-11-09 13:33:22 +03:00
* Called with fi - > lock
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
*/
void fuse_flush_writepages ( struct inode * inode )
2018-11-09 13:33:22 +03:00
__releases ( fi - > lock )
__acquires ( fi - > lock )
2005-09-09 13:10:30 -07:00
{
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2019-04-24 17:05:06 +02:00
loff_t crop = i_size_read ( inode ) ;
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
while ( fi - > writectr > = 0 & & ! list_empty ( & fi - > queued_writes ) ) {
2019-09-10 15:04:10 +02:00
wpa = list_entry ( fi - > queued_writes . next ,
struct fuse_writepage_args , queue_entry ) ;
list_del_init ( & wpa - > queue_entry ) ;
2020-05-06 17:44:12 +02:00
fuse_send_writepage ( fm , wpa , crop ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
}
2020-07-14 14:45:41 +02:00
static struct fuse_writepage_args * fuse_insert_writeback ( struct rb_root * root ,
struct fuse_writepage_args * wpa )
2019-09-19 17:11:20 +03:00
{
pgoff_t idx_from = wpa - > ia . write . in . offset > > PAGE_SHIFT ;
pgoff_t idx_to = idx_from + wpa - > ia . ap . num_pages - 1 ;
struct rb_node * * p = & root - > rb_node ;
struct rb_node * parent = NULL ;
WARN_ON ( ! wpa - > ia . ap . num_pages ) ;
while ( * p ) {
struct fuse_writepage_args * curr ;
pgoff_t curr_index ;
parent = * p ;
curr = rb_entry ( parent , struct fuse_writepage_args ,
writepages_entry ) ;
WARN_ON ( curr - > inode ! = wpa - > inode ) ;
curr_index = curr - > ia . write . in . offset > > PAGE_SHIFT ;
if ( idx_from > = curr_index + curr - > ia . ap . num_pages )
p = & ( * p ) - > rb_right ;
else if ( idx_to < curr_index )
p = & ( * p ) - > rb_left ;
else
2020-07-14 14:45:41 +02:00
return curr ;
2019-09-19 17:11:20 +03:00
}
rb_link_node ( & wpa - > writepages_entry , parent , p ) ;
rb_insert_color ( & wpa - > writepages_entry , root ) ;
2020-07-14 14:45:41 +02:00
return NULL ;
}
static void tree_insert ( struct rb_root * root , struct fuse_writepage_args * wpa )
{
WARN_ON ( fuse_insert_writeback ( root , wpa ) ) ;
2019-09-19 17:11:20 +03:00
}
2020-05-06 17:44:12 +02:00
static void fuse_writepage_end ( struct fuse_mount * fm , struct fuse_args * args ,
2019-09-10 15:04:10 +02:00
int error )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa =
container_of ( args , typeof ( * wpa ) , ia . ap . args ) ;
struct inode * inode = wpa - > inode ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2021-04-06 10:07:06 -04:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2019-09-10 15:04:10 +02:00
mapping_set_error ( inode - > i_mapping , error ) ;
2021-04-06 10:07:06 -04:00
/*
* A writeback finished and this might have updated mtime / ctime on
* server making local mtime / ctime stale . Hence invalidate attrs .
* Do this only if writeback_cache is not enabled . If writeback_cache
* is enabled , we trust local ctime / mtime .
*/
if ( ! fc - > writeback_cache )
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , FUSE_STATX_MODIFY ) ;
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2020-07-14 14:45:41 +02:00
rb_erase ( & wpa - > writepages_entry , & fi - > writepages ) ;
2019-09-10 15:04:10 +02:00
while ( wpa - > next ) {
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2019-09-10 15:04:10 +02:00
struct fuse_write_in * inarg = & wpa - > ia . write . in ;
struct fuse_writepage_args * next = wpa - > next ;
wpa - > next = next - > next ;
next - > next = NULL ;
next - > ia . ff = fuse_file_get ( wpa - > ia . ff ) ;
2019-09-19 17:11:20 +03:00
tree_insert ( & fi - > writepages , next ) ;
2013-10-02 21:38:32 +04:00
/*
* Skip fuse_flush_writepages ( ) to make it easy to crop requests
* based on primary request size .
*
* 1 st case ( trivial ) : there are no concurrent activities using
* fuse_set / release_nowrite . Then we ' re on safe side because
* fuse_flush_writepages ( ) would call fuse_send_writepage ( )
* anyway .
*
* 2 nd case : someone called fuse_set_nowrite and it is waiting
* now for completion of all in - flight requests . This happens
* rarely and no more than once per page , so this should be
* okay .
*
* 3 rd case : someone ( e . g . fuse_do_setattr ( ) ) is in the middle
* of fuse_set_nowrite . . fuse_release_nowrite section . The fact
* that fuse_set_nowrite returned implies that all in - flight
* requests were completed along with all of their secondary
* requests . Further primary requests are blocked by negative
* writectr . Hence there cannot be any in - flight requests and
* no invocations of fuse_writepage_end ( ) while we ' re in
* fuse_set_nowrite . . fuse_release_nowrite section .
*/
2020-05-06 17:44:12 +02:00
fuse_send_writepage ( fm , next , inarg - > offset + inarg - > size ) ;
2013-10-01 16:44:53 +02:00
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
fi - > writectr - - ;
2020-05-06 17:44:12 +02:00
fuse_writepage_finish ( fm , wpa ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2019-09-10 15:04:10 +02:00
fuse_writepage_free ( wpa ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
2021-09-01 12:39:02 +02:00
static struct fuse_file * __fuse_write_file_get ( struct fuse_inode * fi )
2013-06-29 21:42:20 +04:00
{
2021-10-22 17:03:02 +02:00
struct fuse_file * ff ;
2013-06-29 21:42:20 +04:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2021-10-22 17:03:02 +02:00
ff = list_first_entry_or_null ( & fi - > write_files , struct fuse_file ,
write_entry ) ;
if ( ff )
2013-10-01 16:44:52 +02:00
fuse_file_get ( ff ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2013-06-29 21:42:20 +04:00
return ff ;
}
2021-09-01 12:39:02 +02:00
static struct fuse_file * fuse_write_file_get ( struct fuse_inode * fi )
2014-04-28 14:19:23 +02:00
{
2021-09-01 12:39:02 +02:00
struct fuse_file * ff = __fuse_write_file_get ( fi ) ;
2014-04-28 14:19:23 +02:00
WARN_ON ( ! ff ) ;
return ff ;
}
int fuse_write_inode ( struct inode * inode , struct writeback_control * wbc )
{
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
struct fuse_file * ff ;
int err ;
2021-10-22 17:03:01 +02:00
/*
* Inode is always written before the last reference is dropped and
* hence this should not be reached from reclaim .
*
* Writing back the inode from reclaim can deadlock if the request
* processing itself needs an allocation . Allocations triggering
* reclaim while serving a request can ' t be prevented , because it can
* involve any number of unrelated userspace processes .
*/
WARN_ON ( wbc - > for_reclaim ) ;
2021-09-01 12:39:02 +02:00
ff = __fuse_write_file_get ( fi ) ;
2014-04-28 14:19:24 +02:00
err = fuse_flush_times ( inode , ff ) ;
2014-04-28 14:19:23 +02:00
if ( ff )
2018-12-10 10:54:52 -08:00
fuse_file_put ( ff , false , false ) ;
2014-04-28 14:19:23 +02:00
return err ;
}
2019-09-10 15:04:10 +02:00
static struct fuse_writepage_args * fuse_writepage_args_alloc ( void )
{
struct fuse_writepage_args * wpa ;
struct fuse_args_pages * ap ;
wpa = kzalloc ( sizeof ( * wpa ) , GFP_NOFS ) ;
if ( wpa ) {
ap = & wpa - > ia . ap ;
ap - > num_pages = 0 ;
ap - > pages = fuse_pages_alloc ( 1 , GFP_NOFS , & ap - > descs ) ;
if ( ! ap - > pages ) {
kfree ( wpa ) ;
wpa = NULL ;
}
}
return wpa ;
}
2021-09-01 12:39:02 +02:00
static void fuse_writepage_add_to_bucket ( struct fuse_conn * fc ,
struct fuse_writepage_args * wpa )
{
if ( ! fc - > sync_fs )
return ;
rcu_read_lock ( ) ;
/* Prevent resurrection of dead bucket in unlikely race with syncfs */
do {
wpa - > bucket = rcu_dereference ( fc - > curr_bucket ) ;
} while ( unlikely ( ! atomic_inc_not_zero ( & wpa - > bucket - > count ) ) ) ;
rcu_read_unlock ( ) ;
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
static int fuse_writepage_locked ( struct page * page )
{
struct address_space * mapping = page - > mapping ;
struct inode * inode = mapping - > host ;
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa ;
struct fuse_args_pages * ap ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
struct page * tmp_page ;
2013-10-01 16:44:52 +02:00
int error = - ENOMEM ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
set_page_writeback ( page ) ;
2019-09-10 15:04:10 +02:00
wpa = fuse_writepage_args_alloc ( ) ;
if ( ! wpa )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
goto err ;
2019-09-10 15:04:10 +02:00
ap = & wpa - > ia . ap ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
tmp_page = alloc_page ( GFP_NOFS | __GFP_HIGHMEM ) ;
if ( ! tmp_page )
goto err_free ;
2013-10-01 16:44:52 +02:00
error = - EIO ;
2021-09-01 12:39:02 +02:00
wpa - > ia . ff = fuse_write_file_get ( fi ) ;
2019-09-10 15:04:10 +02:00
if ( ! wpa - > ia . ff )
2014-07-10 15:32:43 +04:00
goto err_nofile ;
2013-10-01 16:44:52 +02:00
2021-09-01 12:39:02 +02:00
fuse_writepage_add_to_bucket ( fc , wpa ) ;
2019-09-10 15:04:10 +02:00
fuse_write_args_fill ( & wpa - > ia , wpa - > ia . ff , page_offset ( page ) , 0 ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
copy_highpage ( tmp_page , page ) ;
2019-09-10 15:04:10 +02:00
wpa - > ia . write . in . write_flags | = FUSE_WRITE_CACHE ;
wpa - > next = NULL ;
ap - > args . in_pages = true ;
ap - > num_pages = 1 ;
ap - > pages [ 0 ] = tmp_page ;
ap - > descs [ 0 ] . offset = 0 ;
ap - > descs [ 0 ] . length = PAGE_SIZE ;
ap - > args . end = fuse_writepage_end ;
wpa - > inode = inode ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2015-05-22 17:13:27 -04:00
inc_wb_stat ( & inode_to_bdi ( inode ) - > wb , WB_WRITEBACK ) ;
2016-07-28 15:46:20 -07:00
inc_node_page_state ( tmp_page , NR_WRITEBACK_TEMP ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2019-09-19 17:11:20 +03:00
tree_insert ( & fi - > writepages , wpa ) ;
2019-09-10 15:04:10 +02:00
list_add_tail ( & wpa - > queue_entry , & fi - > queued_writes ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
fuse_flush_writepages ( inode ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
2013-08-12 20:39:30 +04:00
end_page_writeback ( page ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
return 0 ;
2014-07-10 15:32:43 +04:00
err_nofile :
__free_page ( tmp_page ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
err_free :
2019-09-10 15:04:10 +02:00
kfree ( wpa ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
err :
2017-05-25 06:57:50 -04:00
mapping_set_error ( page - > mapping , error ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
end_page_writeback ( page ) ;
2013-10-01 16:44:52 +02:00
return error ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
static int fuse_writepage ( struct page * page , struct writeback_control * wbc )
{
2022-03-22 14:38:58 -07:00
struct fuse_conn * fc = get_fuse_conn ( page - > mapping - > host ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
int err ;
2013-10-01 16:44:53 +02:00
if ( fuse_page_is_writeback ( page - > mapping - > host , page - > index ) ) {
/*
* - > writepages ( ) should be called for sync ( ) and friends . We
* should only get here on direct reclaim and then we are
* allowed to skip a page which is already in flight
*/
WARN_ON ( wbc - > sync_mode = = WB_SYNC_ALL ) ;
redirty_page_for_writepage ( wbc , page ) ;
2019-09-13 18:17:11 +03:00
unlock_page ( page ) ;
2013-10-01 16:44:53 +02:00
return 0 ;
}
2022-03-22 14:38:58 -07:00
if ( wbc - > sync_mode = = WB_SYNC_NONE & &
fc - > num_background > = fc - > congestion_threshold )
return AOP_WRITEPAGE_ACTIVATE ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
err = fuse_writepage_locked ( page ) ;
unlock_page ( page ) ;
return err ;
}
2013-06-29 21:45:29 +04:00
struct fuse_fill_wb_data {
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa ;
2013-06-29 21:45:29 +04:00
struct fuse_file * ff ;
struct inode * inode ;
2013-08-16 15:51:41 +04:00
struct page * * orig_pages ;
2019-09-10 15:04:10 +02:00
unsigned int max_pages ;
2013-06-29 21:45:29 +04:00
} ;
2019-09-10 15:04:10 +02:00
static bool fuse_pages_realloc ( struct fuse_fill_wb_data * data )
{
struct fuse_args_pages * ap = & data - > wpa - > ia . ap ;
struct fuse_conn * fc = get_fuse_conn ( data - > inode ) ;
struct page * * pages ;
struct fuse_page_desc * descs ;
unsigned int npages = min_t ( unsigned int ,
max_t ( unsigned int , data - > max_pages * 2 ,
FUSE_DEFAULT_MAX_PAGES_PER_REQ ) ,
fc - > max_pages ) ;
WARN_ON ( npages < = data - > max_pages ) ;
pages = fuse_pages_alloc ( npages , GFP_NOFS , & descs ) ;
if ( ! pages )
return false ;
memcpy ( pages , ap - > pages , sizeof ( struct page * ) * ap - > num_pages ) ;
memcpy ( descs , ap - > descs , sizeof ( struct fuse_page_desc ) * ap - > num_pages ) ;
kfree ( ap - > pages ) ;
ap - > pages = pages ;
ap - > descs = descs ;
data - > max_pages = npages ;
return true ;
}
2013-06-29 21:45:29 +04:00
static void fuse_writepages_send ( struct fuse_fill_wb_data * data )
{
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa = data - > wpa ;
2013-06-29 21:45:29 +04:00
struct inode * inode = data - > inode ;
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2019-09-10 15:04:10 +02:00
int num_pages = wpa - > ia . ap . num_pages ;
2013-08-16 15:51:41 +04:00
int i ;
2013-06-29 21:45:29 +04:00
2019-09-10 15:04:10 +02:00
wpa - > ia . ff = fuse_file_get ( data - > ff ) ;
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2019-09-10 15:04:10 +02:00
list_add_tail ( & wpa - > queue_entry , & fi - > queued_writes ) ;
2013-06-29 21:45:29 +04:00
fuse_flush_writepages ( inode ) ;
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2013-08-16 15:51:41 +04:00
for ( i = 0 ; i < num_pages ; i + + )
end_page_writeback ( data - > orig_pages [ i ] ) ;
2013-06-29 21:45:29 +04:00
}
2019-01-16 10:27:59 +01:00
/*
2020-07-14 14:45:41 +02:00
* Check under fi - > lock if the page is under writeback , and insert it onto the
* rb_tree if not . Otherwise iterate auxiliary write requests , to see if there ' s
2019-01-16 10:27:59 +01:00
* one already added for a page at this offset . If there ' s none , then insert
* this new request onto the auxiliary list , otherwise reuse the existing one by
2020-07-14 14:45:41 +02:00
* swapping the new temp page with the old one .
2019-01-16 10:27:59 +01:00
*/
2020-07-14 14:45:41 +02:00
static bool fuse_writepage_add ( struct fuse_writepage_args * new_wpa ,
struct page * page )
2013-10-01 16:44:53 +02:00
{
2019-09-10 15:04:10 +02:00
struct fuse_inode * fi = get_fuse_inode ( new_wpa - > inode ) ;
struct fuse_writepage_args * tmp ;
struct fuse_writepage_args * old_wpa ;
struct fuse_args_pages * new_ap = & new_wpa - > ia . ap ;
2013-10-01 16:44:53 +02:00
2019-09-10 15:04:10 +02:00
WARN_ON ( new_ap - > num_pages ! = 0 ) ;
2020-07-14 14:45:41 +02:00
new_ap - > num_pages = 1 ;
2013-10-01 16:44:53 +02:00
2018-11-09 13:33:22 +03:00
spin_lock ( & fi - > lock ) ;
2020-07-14 14:45:41 +02:00
old_wpa = fuse_insert_writeback ( & fi - > writepages , new_wpa ) ;
2019-09-10 15:04:10 +02:00
if ( ! old_wpa ) {
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2020-07-14 14:45:41 +02:00
return true ;
2013-10-02 15:01:07 +04:00
}
2013-10-01 16:44:53 +02:00
2019-09-10 15:04:10 +02:00
for ( tmp = old_wpa - > next ; tmp ; tmp = tmp - > next ) {
2019-01-16 10:27:59 +01:00
pgoff_t curr_index ;
2019-09-10 15:04:10 +02:00
WARN_ON ( tmp - > inode ! = new_wpa - > inode ) ;
curr_index = tmp - > ia . write . in . offset > > PAGE_SHIFT ;
2019-01-16 10:27:59 +01:00
if ( curr_index = = page - > index ) {
2019-09-10 15:04:10 +02:00
WARN_ON ( tmp - > ia . ap . num_pages ! = 1 ) ;
swap ( tmp - > ia . ap . pages [ 0 ] , new_ap - > pages [ 0 ] ) ;
2019-01-16 10:27:59 +01:00
break ;
2013-10-01 16:44:53 +02:00
}
}
2019-01-16 10:27:59 +01:00
if ( ! tmp ) {
2019-09-10 15:04:10 +02:00
new_wpa - > next = old_wpa - > next ;
old_wpa - > next = new_wpa ;
2019-01-16 10:27:59 +01:00
}
2013-10-02 21:38:43 +04:00
2018-11-09 13:33:22 +03:00
spin_unlock ( & fi - > lock ) ;
2019-01-16 10:27:59 +01:00
if ( tmp ) {
2019-09-10 15:04:10 +02:00
struct backing_dev_info * bdi = inode_to_bdi ( new_wpa - > inode ) ;
2013-10-01 16:44:53 +02:00
2015-05-22 17:13:27 -04:00
dec_wb_stat ( & bdi - > wb , WB_WRITEBACK ) ;
2019-09-10 15:04:10 +02:00
dec_node_page_state ( new_ap - > pages [ 0 ] , NR_WRITEBACK_TEMP ) ;
2015-05-22 17:13:27 -04:00
wb_writeout_inc ( & bdi - > wb ) ;
2019-09-10 15:04:10 +02:00
fuse_writepage_free ( new_wpa ) ;
2013-10-01 16:44:53 +02:00
}
2019-01-16 10:27:59 +01:00
2020-07-14 14:45:41 +02:00
return false ;
2013-10-01 16:44:53 +02:00
}
2020-07-14 14:45:41 +02:00
static bool fuse_writepage_need_send ( struct fuse_conn * fc , struct page * page ,
struct fuse_args_pages * ap ,
struct fuse_fill_wb_data * data )
{
WARN_ON ( ! ap - > num_pages ) ;
/*
* Being under writeback is unlikely but possible . For example direct
* read to an mmaped fuse file will set the page dirty twice ; once when
* the pages are faulted with get_user_pages ( ) , and then after the read
* completed .
*/
if ( fuse_page_is_writeback ( data - > inode , page - > index ) )
return true ;
/* Reached max pages */
if ( ap - > num_pages = = fc - > max_pages )
return true ;
/* Reached max write bytes */
if ( ( ap - > num_pages + 1 ) * PAGE_SIZE > fc - > max_write )
return true ;
/* Discontinuity */
if ( data - > orig_pages [ ap - > num_pages - 1 ] - > index + 1 ! = page - > index )
return true ;
/* Need to grow the pages array? If so, did the expansion fail? */
if ( ap - > num_pages = = data - > max_pages & & ! fuse_pages_realloc ( data ) )
return true ;
return false ;
}
2023-01-26 20:12:54 +00:00
static int fuse_writepages_fill ( struct folio * folio ,
2013-06-29 21:45:29 +04:00
struct writeback_control * wbc , void * _data )
{
struct fuse_fill_wb_data * data = _data ;
2019-09-10 15:04:10 +02:00
struct fuse_writepage_args * wpa = data - > wpa ;
struct fuse_args_pages * ap = & wpa - > ia . ap ;
2013-06-29 21:45:29 +04:00
struct inode * inode = data - > inode ;
2018-11-09 13:33:22 +03:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2013-06-29 21:45:29 +04:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
struct page * tmp_page ;
int err ;
if ( ! data - > ff ) {
err = - EIO ;
2021-09-01 12:39:02 +02:00
data - > ff = fuse_write_file_get ( fi ) ;
2013-06-29 21:45:29 +04:00
if ( ! data - > ff )
goto out_unlock ;
}
2023-01-26 20:12:54 +00:00
if ( wpa & & fuse_writepage_need_send ( fc , & folio - > page , ap , data ) ) {
2013-10-01 16:44:53 +02:00
fuse_writepages_send ( data ) ;
2019-09-10 15:04:10 +02:00
data - > wpa = NULL ;
2013-06-29 21:45:29 +04:00
}
2018-10-01 10:07:06 +02:00
2013-06-29 21:45:29 +04:00
err = - ENOMEM ;
tmp_page = alloc_page ( GFP_NOFS | __GFP_HIGHMEM ) ;
if ( ! tmp_page )
goto out_unlock ;
/*
* The page must not be redirtied until the writeout is completed
* ( i . e . userspace has sent a reply to the write request ) . Otherwise
* there could be more than one temporary page instance for each real
* page .
*
* This is ensured by holding the page lock in page_mkwrite ( ) while
* checking fuse_page_is_writeback ( ) . We already hold the page lock
* since clear_page_dirty_for_io ( ) and keep it held until we add the
2019-09-10 15:04:10 +02:00
* request to the fi - > writepages list and increment ap - > num_pages .
2013-06-29 21:45:29 +04:00
* After this fuse_page_is_writeback ( ) will indicate that the page is
* under writeback , so we can release the page lock .
*/
2019-09-10 15:04:10 +02:00
if ( data - > wpa = = NULL ) {
2013-06-29 21:45:29 +04:00
err = - ENOMEM ;
2019-09-10 15:04:10 +02:00
wpa = fuse_writepage_args_alloc ( ) ;
if ( ! wpa ) {
2013-06-29 21:45:29 +04:00
__free_page ( tmp_page ) ;
goto out_unlock ;
}
2021-09-01 12:39:02 +02:00
fuse_writepage_add_to_bucket ( fc , wpa ) ;
2019-09-10 15:04:10 +02:00
data - > max_pages = 1 ;
2013-06-29 21:45:29 +04:00
2019-09-10 15:04:10 +02:00
ap = & wpa - > ia . ap ;
2023-01-26 20:12:54 +00:00
fuse_write_args_fill ( & wpa - > ia , data - > ff , folio_pos ( folio ) , 0 ) ;
2019-09-10 15:04:10 +02:00
wpa - > ia . write . in . write_flags | = FUSE_WRITE_CACHE ;
wpa - > next = NULL ;
ap - > args . in_pages = true ;
ap - > args . end = fuse_writepage_end ;
ap - > num_pages = 0 ;
wpa - > inode = inode ;
2013-06-29 21:45:29 +04:00
}
2023-01-26 20:12:54 +00:00
folio_start_writeback ( folio ) ;
2013-06-29 21:45:29 +04:00
2023-01-26 20:12:54 +00:00
copy_highpage ( tmp_page , & folio - > page ) ;
2019-09-10 15:04:10 +02:00
ap - > pages [ ap - > num_pages ] = tmp_page ;
ap - > descs [ ap - > num_pages ] . offset = 0 ;
ap - > descs [ ap - > num_pages ] . length = PAGE_SIZE ;
2023-01-26 20:12:54 +00:00
data - > orig_pages [ ap - > num_pages ] = & folio - > page ;
2013-06-29 21:45:29 +04:00
2015-05-22 17:13:27 -04:00
inc_wb_stat ( & inode_to_bdi ( inode ) - > wb , WB_WRITEBACK ) ;
2016-07-28 15:46:20 -07:00
inc_node_page_state ( tmp_page , NR_WRITEBACK_TEMP ) ;
2013-10-01 16:44:53 +02:00
err = 0 ;
2020-07-14 14:45:41 +02:00
if ( data - > wpa ) {
/*
* Protected by fi - > lock against concurrent access by
* fuse_page_is_writeback ( ) .
*/
spin_lock ( & fi - > lock ) ;
ap - > num_pages + + ;
spin_unlock ( & fi - > lock ) ;
2023-01-26 20:12:54 +00:00
} else if ( fuse_writepage_add ( wpa , & folio - > page ) ) {
2020-07-14 14:45:41 +02:00
data - > wpa = wpa ;
} else {
2023-01-26 20:12:54 +00:00
folio_end_writeback ( folio ) ;
2013-10-01 16:44:53 +02:00
}
2013-06-29 21:45:29 +04:00
out_unlock :
2023-01-26 20:12:54 +00:00
folio_unlock ( folio ) ;
2013-06-29 21:45:29 +04:00
return err ;
}
static int fuse_writepages ( struct address_space * mapping ,
struct writeback_control * wbc )
{
struct inode * inode = mapping - > host ;
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2013-06-29 21:45:29 +04:00
struct fuse_fill_wb_data data ;
int err ;
err = - EIO ;
2020-12-10 15:33:14 +01:00
if ( fuse_is_bad ( inode ) )
2013-06-29 21:45:29 +04:00
goto out ;
2022-03-22 14:38:58 -07:00
if ( wbc - > sync_mode = = WB_SYNC_NONE & &
fc - > num_background > = fc - > congestion_threshold )
return 0 ;
2013-06-29 21:45:29 +04:00
data . inode = inode ;
2019-09-10 15:04:10 +02:00
data . wpa = NULL ;
2013-06-29 21:45:29 +04:00
data . ff = NULL ;
2013-08-16 15:51:41 +04:00
err = - ENOMEM ;
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
data . orig_pages = kcalloc ( fc - > max_pages ,
2014-06-23 18:35:15 +02:00
sizeof ( struct page * ) ,
2013-08-16 15:51:41 +04:00
GFP_NOFS ) ;
if ( ! data . orig_pages )
goto out ;
2013-06-29 21:45:29 +04:00
err = write_cache_pages ( mapping , wbc , fuse_writepages_fill , & data ) ;
2019-09-10 15:04:10 +02:00
if ( data . wpa ) {
WARN_ON ( ! data . wpa - > ia . ap . num_pages ) ;
2013-06-29 21:45:29 +04:00
fuse_writepages_send ( & data ) ;
}
if ( data . ff )
2018-12-10 10:54:52 -08:00
fuse_file_put ( data . ff , false , false ) ;
2013-08-16 15:51:41 +04:00
kfree ( data . orig_pages ) ;
2013-06-29 21:45:29 +04:00
out :
return err ;
}
2013-10-10 17:11:43 +04:00
/*
* It ' s worthy to make sure that space is reserved on disk for the write ,
* but how to implement it without killing performance need more thinking .
*/
static int fuse_write_begin ( struct file * file , struct address_space * mapping ,
2022-02-22 14:31:43 -05:00
loff_t pos , unsigned len , struct page * * pagep , void * * fsdata )
2013-10-10 17:11:43 +04:00
{
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
pgoff_t index = pos > > PAGE_SHIFT ;
2014-10-21 20:11:25 -04:00
struct fuse_conn * fc = get_fuse_conn ( file_inode ( file ) ) ;
2013-10-10 17:11:43 +04:00
struct page * page ;
loff_t fsize ;
int err = - ENOMEM ;
WARN_ON ( ! fc - > writeback_cache ) ;
2022-02-22 11:25:12 -05:00
page = grab_cache_page_write_begin ( mapping , index ) ;
2013-10-10 17:11:43 +04:00
if ( ! page )
goto error ;
fuse_wait_on_page_writeback ( mapping - > host , page - > index ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
if ( PageUptodate ( page ) | | len = = PAGE_SIZE )
2013-10-10 17:11:43 +04:00
goto success ;
/*
* Check if the start this page comes after the end of file , in which
* case the readpage can be optimized away .
*/
fsize = i_size_read ( mapping - > host ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
if ( fsize < = ( pos & PAGE_MASK ) ) {
size_t off = pos & ~ PAGE_MASK ;
2013-10-10 17:11:43 +04:00
if ( off )
zero_user_segment ( page , 0 , off ) ;
goto success ;
}
err = fuse_do_readpage ( file , page ) ;
if ( err )
goto cleanup ;
success :
* pagep = page ;
return 0 ;
cleanup :
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2013-10-10 17:11:43 +04:00
error :
return err ;
}
static int fuse_write_end ( struct file * file , struct address_space * mapping ,
loff_t pos , unsigned len , unsigned copied ,
struct page * page , void * fsdata )
{
struct inode * inode = page - > mapping - > host ;
2016-08-18 09:10:44 +02:00
/* Haven't copied anything? Skip zeroing, size extending, dirtying. */
if ( ! copied )
goto unlock ;
2021-10-22 17:03:02 +02:00
pos + = copied ;
2013-10-10 17:11:43 +04:00
if ( ! PageUptodate ( page ) ) {
/* Zero any unwritten bytes at the end of the page */
2021-10-22 17:03:02 +02:00
size_t endoff = pos & ~ PAGE_MASK ;
2013-10-10 17:11:43 +04:00
if ( endoff )
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
zero_user_segment ( page , endoff , PAGE_SIZE ) ;
2013-10-10 17:11:43 +04:00
SetPageUptodate ( page ) ;
}
2021-10-22 17:03:02 +02:00
if ( pos > inode - > i_size )
i_size_write ( inode , pos ) ;
2013-10-10 17:11:43 +04:00
set_page_dirty ( page ) ;
2016-08-18 09:10:44 +02:00
unlock :
2013-10-10 17:11:43 +04:00
unlock_page ( page ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
put_page ( page ) ;
2013-10-10 17:11:43 +04:00
return copied ;
}
2022-02-09 20:21:56 +00:00
static int fuse_launder_folio ( struct folio * folio )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
int err = 0 ;
2022-02-09 20:21:56 +00:00
if ( folio_clear_dirty_for_io ( folio ) ) {
struct inode * inode = folio - > mapping - > host ;
2020-11-11 17:22:31 +01:00
/* Serialize with pending writeback for the same page */
2022-02-09 20:21:56 +00:00
fuse_wait_on_page_writeback ( inode , folio - > index ) ;
err = fuse_writepage_locked ( & folio - > page ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
if ( ! err )
2022-02-09 20:21:56 +00:00
fuse_wait_on_page_writeback ( inode , folio - > index ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
return err ;
}
/*
2021-10-22 17:03:01 +02:00
* Write back dirty data / metadata now ( there may not be any suitable
* open files later for data )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
*/
static void fuse_vma_close ( struct vm_area_struct * vma )
{
2021-10-22 17:03:01 +02:00
int err ;
err = write_inode_now ( vma - > vm_file - > f_mapping - > host , 1 ) ;
mapping_set_error ( vma - > vm_file - > f_mapping , err ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
/*
* Wait for writeback against this page to complete before allowing it
* to be marked dirty again , and hence written back again , possibly
* before the previous writepage completed .
*
* Block here , instead of in - > writepage ( ) , so that the userspace fs
* can only block processes actually operating on the filesystem .
*
* Otherwise unprivileged userspace fs would be able to block
* unrelated :
*
* - page migration
* - sync ( 2 )
* - try_to_free_pages ( ) with order > PAGE_ALLOC_COSTLY_ORDER
*/
2018-05-12 10:25:37 +05:30
static vm_fault_t fuse_page_mkwrite ( struct vm_fault * vmf )
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
{
2009-03-31 15:23:21 -07:00
struct page * page = vmf - > page ;
2017-02-24 14:56:41 -08:00
struct inode * inode = file_inode ( vmf - > vma - > vm_file ) ;
2013-10-01 16:44:51 +02:00
2017-02-24 14:56:41 -08:00
file_update_time ( vmf - > vma - > vm_file ) ;
2013-10-01 16:44:51 +02:00
lock_page ( page ) ;
if ( page - > mapping ! = inode - > i_mapping ) {
unlock_page ( page ) ;
return VM_FAULT_NOPAGE ;
}
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
fuse_wait_on_page_writeback ( inode , page - > index ) ;
2013-10-01 16:44:51 +02:00
return VM_FAULT_LOCKED ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
}
2009-09-27 22:29:37 +04:00
static const struct vm_operations_struct fuse_file_vm_ops = {
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
. close = fuse_vma_close ,
. fault = filemap_fault ,
2014-04-07 15:37:19 -07:00
. map_pages = filemap_map_pages ,
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
. page_mkwrite = fuse_page_mkwrite ,
} ;
static int fuse_file_mmap ( struct file * file , struct vm_area_struct * vma )
{
2019-01-24 10:40:17 +01:00
struct fuse_file * ff = file - > private_data ;
2020-08-19 18:19:52 -04:00
/* DAX mmap is superior to direct_io mmap */
if ( FUSE_IS_DAX ( file_inode ( file ) ) )
return fuse_dax_mmap ( file , vma ) ;
2019-01-24 10:40:17 +01:00
if ( ff - > open_flags & FOPEN_DIRECT_IO ) {
/* Can't provide the coherency needed for MAP_SHARED */
if ( vma - > vm_flags & VM_MAYSHARE )
return - ENODEV ;
invalidate_inode_pages2 ( file - > f_mapping ) ;
return generic_file_mmap ( file , vma ) ;
}
2013-10-10 17:10:04 +04:00
if ( ( vma - > vm_flags & VM_SHARED ) & & ( vma - > vm_flags & VM_MAYWRITE ) )
fuse_link_write_file ( file ) ;
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
file_accessed ( file ) ;
vma - > vm_ops = & fuse_file_vm_ops ;
2005-09-09 13:10:30 -07:00
return 0 ;
}
2014-07-02 16:29:19 -05:00
static int convert_fuse_file_lock ( struct fuse_conn * fc ,
const struct fuse_file_lock * ffl ,
2006-06-25 05:48:52 -07:00
struct file_lock * fl )
{
switch ( ffl - > type ) {
case F_UNLCK :
break ;
case F_RDLCK :
case F_WRLCK :
if ( ffl - > start > OFFSET_MAX | | ffl - > end > OFFSET_MAX | |
ffl - > end < ffl - > start )
return - EIO ;
fl - > fl_start = ffl - > start ;
fl - > fl_end = ffl - > end ;
2014-07-02 16:29:19 -05:00
/*
fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
Since commit c69899a17ca4 "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context. The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.
The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK. There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid. So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.
The translaton and presentation of fl_pid should handle the following four
cases:
1 - F_GETLK on a remote file with a remote lock:
In this case, the filesystem should determine the l_pid to return here.
Filesystems should indicate that the fl_pid represents a non-local pid
value that should not be translated by returning an fl_pid <= 0.
2 - F_GETLK on a local file with a remote lock:
This should be the l_pid of the lock manager process, and translated.
3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
These should be the translated l_pid of the local locking process.
Fuse was already doing the correct thing by translating the pid into the
caller's namespace. With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.
With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0. This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.
Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.
Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.
If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid. This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-16 10:28:22 -04:00
* Convert pid into init ' s pid namespace . The locks API will
* translate it into the caller ' s pid namespace .
2014-07-02 16:29:19 -05:00
*/
rcu_read_lock ( ) ;
fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
Since commit c69899a17ca4 "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context. The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.
The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK. There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid. So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.
The translaton and presentation of fl_pid should handle the following four
cases:
1 - F_GETLK on a remote file with a remote lock:
In this case, the filesystem should determine the l_pid to return here.
Filesystems should indicate that the fl_pid represents a non-local pid
value that should not be translated by returning an fl_pid <= 0.
2 - F_GETLK on a local file with a remote lock:
This should be the l_pid of the lock manager process, and translated.
3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
These should be the translated l_pid of the local locking process.
Fuse was already doing the correct thing by translating the pid into the
caller's namespace. With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.
With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0. This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.
Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.
Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.
If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid. This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-16 10:28:22 -04:00
fl - > fl_pid = pid_nr_ns ( find_pid_ns ( ffl - > pid , fc - > pid_ns ) , & init_pid_ns ) ;
2014-07-02 16:29:19 -05:00
rcu_read_unlock ( ) ;
2006-06-25 05:48:52 -07:00
break ;
default :
return - EIO ;
}
fl - > fl_type = ffl - > type ;
return 0 ;
}
2014-12-12 09:49:05 +01:00
static void fuse_lk_fill ( struct fuse_args * args , struct file * file ,
2007-10-18 03:07:02 -07:00
const struct file_lock * fl , int opcode , pid_t pid ,
2014-12-12 09:49:05 +01:00
int flock , struct fuse_lk_in * inarg )
2006-06-25 05:48:52 -07:00
{
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2006-06-25 05:48:55 -07:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
2006-06-25 05:48:52 -07:00
struct fuse_file * ff = file - > private_data ;
2014-12-12 09:49:05 +01:00
memset ( inarg , 0 , sizeof ( * inarg ) ) ;
inarg - > fh = ff - > fh ;
inarg - > owner = fuse_lock_owner_id ( fc , fl - > fl_owner ) ;
inarg - > lk . start = fl - > fl_start ;
inarg - > lk . end = fl - > fl_end ;
inarg - > lk . type = fl - > fl_type ;
inarg - > lk . pid = pid ;
2007-10-18 03:07:02 -07:00
if ( flock )
2014-12-12 09:49:05 +01:00
inarg - > lk_flags | = FUSE_LK_FLOCK ;
2019-09-10 15:04:08 +02:00
args - > opcode = opcode ;
args - > nodeid = get_node_id ( inode ) ;
args - > in_numargs = 1 ;
args - > in_args [ 0 ] . size = sizeof ( * inarg ) ;
args - > in_args [ 0 ] . value = inarg ;
2006-06-25 05:48:52 -07:00
}
static int fuse_getlk ( struct file * file , struct file_lock * fl )
{
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
struct fuse_lk_in inarg ;
2006-06-25 05:48:52 -07:00
struct fuse_lk_out outarg ;
int err ;
2014-12-12 09:49:05 +01:00
fuse_lk_fill ( & args , file , fl , FUSE_GETLK , 0 , 0 , & inarg ) ;
2019-09-10 15:04:08 +02:00
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( outarg ) ;
args . out_args [ 0 ] . value = & outarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2006-06-25 05:48:52 -07:00
if ( ! err )
2020-05-06 17:44:12 +02:00
err = convert_fuse_file_lock ( fm - > fc , & outarg . lk , fl ) ;
2006-06-25 05:48:52 -07:00
return err ;
}
2007-10-18 03:07:02 -07:00
static int fuse_setlk ( struct file * file , struct file_lock * fl , int flock )
2006-06-25 05:48:52 -07:00
{
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
struct fuse_lk_in inarg ;
2006-06-25 05:48:52 -07:00
int opcode = ( fl - > fl_flags & FL_SLEEP ) ? FUSE_SETLKW : FUSE_SETLK ;
2014-07-02 16:29:19 -05:00
struct pid * pid = fl - > fl_type ! = F_UNLCK ? task_tgid ( current ) : NULL ;
2020-05-06 17:44:12 +02:00
pid_t pid_nr = pid_nr_ns ( pid , fm - > fc - > pid_ns ) ;
2006-06-25 05:48:52 -07:00
int err ;
2011-07-20 20:21:59 -04:00
if ( fl - > fl_lmops & & fl - > fl_lmops - > lm_grant ) {
2008-07-25 01:49:02 -07:00
/* NLM needs asynchronous locks, which we don't support yet */
return - ENOLCK ;
}
2006-06-25 05:48:52 -07:00
/* Unlock on close is handled by the flush method */
2017-04-11 12:50:09 -04:00
if ( ( fl - > fl_flags & FL_CLOSE_POSIX ) = = FL_CLOSE_POSIX )
2006-06-25 05:48:52 -07:00
return 0 ;
2014-07-02 16:29:19 -05:00
fuse_lk_fill ( & args , file , fl , opcode , pid_nr , flock , & inarg ) ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2006-06-25 05:48:52 -07:00
2006-06-25 05:48:54 -07:00
/* locking is restartable */
if ( err = = - EINTR )
err = - ERESTARTSYS ;
2014-12-12 09:49:05 +01:00
2006-06-25 05:48:52 -07:00
return err ;
}
static int fuse_file_lock ( struct file * file , int cmd , struct file_lock * fl )
{
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2006-06-25 05:48:52 -07:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
int err ;
2008-07-25 01:49:02 -07:00
if ( cmd = = F_CANCELLK ) {
err = 0 ;
} else if ( cmd = = F_GETLK ) {
2006-06-25 05:48:52 -07:00
if ( fc - > no_lock ) {
2007-02-21 00:55:18 -05:00
posix_test_lock ( file , fl ) ;
2006-06-25 05:48:52 -07:00
err = 0 ;
} else
err = fuse_getlk ( file , fl ) ;
} else {
if ( fc - > no_lock )
2008-07-25 01:49:02 -07:00
err = posix_lock_file ( file , fl , NULL ) ;
2006-06-25 05:48:52 -07:00
else
2007-10-18 03:07:02 -07:00
err = fuse_setlk ( file , fl , 0 ) ;
2006-06-25 05:48:52 -07:00
}
return err ;
}
2007-10-18 03:07:02 -07:00
static int fuse_file_flock ( struct file * file , int cmd , struct file_lock * fl )
{
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2007-10-18 03:07:02 -07:00
struct fuse_conn * fc = get_fuse_conn ( inode ) ;
int err ;
2011-08-08 16:08:08 +02:00
if ( fc - > no_flock ) {
2015-10-22 13:38:14 -04:00
err = locks_lock_file_wait ( file , fl ) ;
2007-10-18 03:07:02 -07:00
} else {
2011-08-08 16:08:08 +02:00
struct fuse_file * ff = file - > private_data ;
2007-10-18 03:07:02 -07:00
/* emulate flock with POSIX locks */
2011-08-08 16:08:08 +02:00
ff - > flock = true ;
2007-10-18 03:07:02 -07:00
err = fuse_setlk ( file , fl , 1 ) ;
}
return err ;
}
2006-12-06 20:35:51 -08:00
static sector_t fuse_bmap ( struct address_space * mapping , sector_t block )
{
struct inode * inode = mapping - > host ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
2006-12-06 20:35:51 -08:00
struct fuse_bmap_in inarg ;
struct fuse_bmap_out outarg ;
int err ;
2020-05-06 17:44:12 +02:00
if ( ! inode - > i_sb - > s_bdev | | fm - > fc - > no_bmap )
2006-12-06 20:35:51 -08:00
return 0 ;
memset ( & inarg , 0 , sizeof ( inarg ) ) ;
inarg . block = block ;
inarg . blocksize = inode - > i_sb - > s_blocksize ;
2019-09-10 15:04:08 +02:00
args . opcode = FUSE_BMAP ;
args . nodeid = get_node_id ( inode ) ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( outarg ) ;
args . out_args [ 0 ] . value = & outarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2006-12-06 20:35:51 -08:00
if ( err = = - ENOSYS )
2020-05-06 17:44:12 +02:00
fm - > fc - > no_bmap = 1 ;
2006-12-06 20:35:51 -08:00
return err ? 0 : outarg . block ;
}
2015-06-30 23:40:22 +05:30
static loff_t fuse_lseek ( struct file * file , loff_t offset , int whence )
{
struct inode * inode = file - > f_mapping - > host ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = get_fuse_mount ( inode ) ;
2015-06-30 23:40:22 +05:30
struct fuse_file * ff = file - > private_data ;
FUSE_ARGS ( args ) ;
struct fuse_lseek_in inarg = {
. fh = ff - > fh ,
. offset = offset ,
. whence = whence
} ;
struct fuse_lseek_out outarg ;
int err ;
2020-05-06 17:44:12 +02:00
if ( fm - > fc - > no_lseek )
2015-06-30 23:40:22 +05:30
goto fallback ;
2019-09-10 15:04:08 +02:00
args . opcode = FUSE_LSEEK ;
args . nodeid = ff - > nodeid ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( outarg ) ;
args . out_args [ 0 ] . value = & outarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2015-06-30 23:40:22 +05:30
if ( err ) {
if ( err = = - ENOSYS ) {
2020-05-06 17:44:12 +02:00
fm - > fc - > no_lseek = 1 ;
2015-06-30 23:40:22 +05:30
goto fallback ;
}
return err ;
}
return vfs_setpos ( file , outarg . offset , inode - > i_sb - > s_maxbytes ) ;
fallback :
2021-10-22 17:03:03 +02:00
err = fuse_update_attributes ( inode , file , STATX_SIZE ) ;
2015-06-30 23:40:22 +05:30
if ( ! err )
return generic_file_llseek ( file , offset , whence ) ;
else
return err ;
}
2012-12-17 15:59:39 -08:00
static loff_t fuse_file_llseek ( struct file * file , loff_t offset , int whence )
2008-04-30 00:54:45 -07:00
{
loff_t retval ;
2013-02-27 16:59:05 -05:00
struct inode * inode = file_inode ( file ) ;
2008-04-30 00:54:45 -07:00
2015-06-30 23:40:22 +05:30
switch ( whence ) {
case SEEK_SET :
case SEEK_CUR :
/* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */
2012-12-17 15:59:39 -08:00
retval = generic_file_llseek ( file , offset , whence ) ;
2015-06-30 23:40:22 +05:30
break ;
case SEEK_END :
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2021-10-22 17:03:03 +02:00
retval = fuse_update_attributes ( inode , file , STATX_SIZE ) ;
2015-06-30 23:40:22 +05:30
if ( ! retval )
retval = generic_file_llseek ( file , offset , whence ) ;
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2015-06-30 23:40:22 +05:30
break ;
case SEEK_HOLE :
case SEEK_DATA :
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2015-06-30 23:40:22 +05:30
retval = fuse_lseek ( file , offset , whence ) ;
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2015-06-30 23:40:22 +05:30
break ;
default :
retval = - EINVAL ;
}
2011-12-13 11:58:48 +01:00
2008-04-30 00:54:45 -07:00
return retval ;
}
2008-11-26 12:03:55 +01:00
/*
* All files which have been polled are linked to RB tree
* fuse_conn - > polled_files which is indexed by kh . Walk the tree and
* find the matching one .
*/
static struct rb_node * * fuse_find_polled_node ( struct fuse_conn * fc , u64 kh ,
struct rb_node * * parent_out )
{
struct rb_node * * link = & fc - > polled_files . rb_node ;
struct rb_node * last = NULL ;
while ( * link ) {
struct fuse_file * ff ;
last = * link ;
ff = rb_entry ( last , struct fuse_file , polled_node ) ;
if ( kh < ff - > kh )
link = & last - > rb_left ;
else if ( kh > ff - > kh )
link = & last - > rb_right ;
else
return link ;
}
if ( parent_out )
* parent_out = last ;
return link ;
}
/*
* The file is about to be polled . Make sure it ' s on the polled_files
* RB tree . Note that files once added to the polled_files tree are
* not removed before the file is released . This is because a file
* polled once is likely to be polled again .
*/
static void fuse_register_polled_file ( struct fuse_conn * fc ,
struct fuse_file * ff )
{
spin_lock ( & fc - > lock ) ;
if ( RB_EMPTY_NODE ( & ff - > polled_node ) ) {
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-03 13:09:38 -07:00
struct rb_node * * link , * parent ;
2008-11-26 12:03:55 +01:00
link = fuse_find_polled_node ( fc , ff - > kh , & parent ) ;
BUG_ON ( * link ) ;
rb_link_node ( & ff - > polled_node , parent , link ) ;
rb_insert_color ( & ff - > polled_node , & fc - > polled_files ) ;
}
spin_unlock ( & fc - > lock ) ;
}
2017-07-03 01:02:18 -04:00
__poll_t fuse_file_poll ( struct file * file , poll_table * wait )
2008-11-26 12:03:55 +01:00
{
struct fuse_file * ff = file - > private_data ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2008-11-26 12:03:55 +01:00
struct fuse_poll_in inarg = { . fh = ff - > fh , . kh = ff - > kh } ;
struct fuse_poll_out outarg ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
2008-11-26 12:03:55 +01:00
int err ;
2020-05-06 17:44:12 +02:00
if ( fm - > fc - > no_poll )
2008-11-26 12:03:55 +01:00
return DEFAULT_POLLMASK ;
poll_wait ( file , & ff - > poll_wait , wait ) ;
2017-11-29 19:00:41 -05:00
inarg . events = mangle_poll ( poll_requested_events ( wait ) ) ;
2008-11-26 12:03:55 +01:00
/*
* Ask for notification iff there ' s someone waiting for it .
* The client may ignore the flag and always notify .
*/
if ( waitqueue_active ( & ff - > poll_wait ) ) {
inarg . flags | = FUSE_POLL_SCHEDULE_NOTIFY ;
2020-05-06 17:44:12 +02:00
fuse_register_polled_file ( fm - > fc , ff ) ;
2008-11-26 12:03:55 +01:00
}
2019-09-10 15:04:08 +02:00
args . opcode = FUSE_POLL ;
args . nodeid = ff - > nodeid ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( outarg ) ;
args . out_args [ 0 ] . value = & outarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2008-11-26 12:03:55 +01:00
if ( ! err )
2017-11-29 19:00:41 -05:00
return demangle_poll ( outarg . revents ) ;
2008-11-26 12:03:55 +01:00
if ( err = = - ENOSYS ) {
2020-05-06 17:44:12 +02:00
fm - > fc - > no_poll = 1 ;
2008-11-26 12:03:55 +01:00
return DEFAULT_POLLMASK ;
}
2018-02-11 14:34:03 -08:00
return EPOLLERR ;
2008-11-26 12:03:55 +01:00
}
2009-04-14 10:54:53 +09:00
EXPORT_SYMBOL_GPL ( fuse_file_poll ) ;
2008-11-26 12:03:55 +01:00
/*
* This is called from fuse_handle_notify ( ) on FUSE_NOTIFY_POLL and
* wakes up the poll waiters .
*/
int fuse_notify_poll_wakeup ( struct fuse_conn * fc ,
struct fuse_notify_poll_wakeup_out * outarg )
{
u64 kh = outarg - > kh ;
struct rb_node * * link ;
spin_lock ( & fc - > lock ) ;
link = fuse_find_polled_node ( fc , kh , NULL ) ;
if ( * link ) {
struct fuse_file * ff ;
ff = rb_entry ( * link , struct fuse_file , polled_node ) ;
wake_up_interruptible_sync ( & ff - > poll_wait ) ;
}
spin_unlock ( & fc - > lock ) ;
return 0 ;
}
2012-12-18 14:05:08 +04:00
static void fuse_do_truncate ( struct file * file )
{
struct inode * inode = file - > f_mapping - > host ;
struct iattr attr ;
attr . ia_valid = ATTR_SIZE ;
attr . ia_size = i_size_read ( inode ) ;
attr . ia_file = file ;
attr . ia_valid | = ATTR_FILE ;
2016-05-26 17:12:41 +02:00
fuse_do_setattr ( file_dentry ( file ) , & attr , file ) ;
2012-12-18 14:05:08 +04:00
}
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
static inline loff_t fuse_round_up ( struct fuse_conn * fc , loff_t off )
2013-05-30 16:41:34 +04:00
{
fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 15:37:06 +03:00
return round_up ( off , fc - > max_pages < < PAGE_SHIFT ) ;
2013-05-30 16:41:34 +04:00
}
2012-02-17 12:46:25 -05:00
static ssize_t
2016-04-07 08:51:58 -07:00
fuse_direct_IO ( struct kiocb * iocb , struct iov_iter * iter )
2012-02-17 12:46:25 -05:00
{
2015-02-02 14:59:43 +01:00
DECLARE_COMPLETION_ONSTACK ( wait ) ;
2012-02-17 12:46:25 -05:00
ssize_t ret = 0 ;
2013-05-01 14:37:21 +02:00
struct file * file = iocb - > ki_filp ;
struct fuse_file * ff = file - > private_data ;
2012-02-17 12:46:25 -05:00
loff_t pos = 0 ;
2012-12-14 19:21:08 +04:00
struct inode * inode ;
loff_t i_size ;
2020-09-17 17:26:56 -04:00
size_t count = iov_iter_count ( iter ) , shortened = 0 ;
2016-04-07 08:51:58 -07:00
loff_t offset = iocb - > ki_pos ;
2012-12-14 19:20:51 +04:00
struct fuse_io_priv * io ;
2012-02-17 12:46:25 -05:00
pos = offset ;
2012-12-14 19:21:08 +04:00
inode = file - > f_mapping - > host ;
i_size = i_size_read ( inode ) ;
2012-02-17 12:46:25 -05:00
2020-09-17 17:26:56 -04:00
if ( ( iov_iter_rw ( iter ) = = READ ) & & ( offset > = i_size ) )
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 14:42:22 +00:00
return 0 ;
2012-12-14 19:21:08 +04:00
io = kmalloc ( sizeof ( struct fuse_io_priv ) , GFP_KERNEL ) ;
2012-12-14 19:20:51 +04:00
if ( ! io )
return - ENOMEM ;
2012-12-14 19:21:08 +04:00
spin_lock_init ( & io - > lock ) ;
2016-03-11 10:35:34 -06:00
kref_init ( & io - > refcnt ) ;
2012-12-14 19:21:08 +04:00
io - > reqs = 1 ;
io - > bytes = - 1 ;
io - > size = 0 ;
io - > offset = offset ;
2015-03-16 04:33:52 -07:00
io - > write = ( iov_iter_rw ( iter ) = = WRITE ) ;
2012-12-14 19:21:08 +04:00
io - > err = 0 ;
/*
* By default , we want to optimize all I / Os with async request
2013-05-01 14:37:21 +02:00
* submission to the client filesystem if supported .
2012-12-14 19:21:08 +04:00
*/
2020-10-19 14:28:30 -07:00
io - > async = ff - > fm - > fc - > async_dio ;
2012-12-14 19:21:08 +04:00
io - > iocb = iocb ;
2016-04-07 17:18:11 +05:30
io - > blocking = is_sync_kiocb ( iocb ) ;
2012-12-14 19:21:08 +04:00
2020-09-17 17:26:56 -04:00
/* optimization for short read */
if ( io - > async & & ! io - > write & & offset + count > i_size ) {
2020-10-19 14:28:30 -07:00
iov_iter_truncate ( iter , fuse_round_up ( ff - > fm - > fc , i_size - offset ) ) ;
2020-09-17 17:26:56 -04:00
shortened = count - iov_iter_count ( iter ) ;
count - = shortened ;
}
2012-12-14 19:21:08 +04:00
/*
2016-04-07 17:18:11 +05:30
* We cannot asynchronously extend the size of a file .
* In such case the aio will behave exactly like sync io .
2012-12-14 19:21:08 +04:00
*/
2020-09-17 17:26:56 -04:00
if ( ( offset + count > i_size ) & & io - > write )
2016-04-07 17:18:11 +05:30
io - > blocking = true ;
2012-02-17 12:46:25 -05:00
2016-04-07 17:18:11 +05:30
if ( io - > async & & io - > blocking ) {
2016-03-11 10:35:34 -06:00
/*
* Additional reference to keep io around after
* calling fuse_aio_complete ( )
*/
kref_get ( & io - > refcnt ) ;
2015-02-02 14:59:43 +01:00
io - > done = & wait ;
2016-03-11 10:35:34 -06:00
}
2015-02-02 14:59:43 +01:00
2015-03-16 04:33:52 -07:00
if ( iov_iter_rw ( iter ) = = WRITE ) {
2015-04-07 15:06:19 -04:00
ret = fuse_direct_io ( io , iter , & pos , FUSE_DIO_WRITE ) ;
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , FUSE_STATX_MODSIZE ) ;
2015-03-30 22:15:58 -04:00
} else {
2014-03-16 15:50:47 -04:00
ret = __fuse_direct_read ( io , iter , & pos ) ;
2015-03-30 22:15:58 -04:00
}
2020-09-17 17:26:56 -04:00
iov_iter_reexpand ( iter , iov_iter_count ( iter ) + shortened ) ;
2012-12-14 19:20:51 +04:00
2012-12-14 19:21:08 +04:00
if ( io - > async ) {
2018-11-09 14:51:46 +01:00
bool blocking = io - > blocking ;
2012-12-14 19:21:08 +04:00
fuse_aio_complete ( io , ret < 0 ? ret : 0 , - 1 ) ;
/* we have a non-extending, async request, so return */
2018-11-09 14:51:46 +01:00
if ( ! blocking )
2012-12-14 19:21:08 +04:00
return - EIOCBQUEUED ;
2015-02-02 14:59:43 +01:00
wait_for_completion ( & wait ) ;
ret = fuse_get_res_by_io ( io ) ;
2012-12-14 19:21:08 +04:00
}
2016-03-11 10:35:34 -06:00
kref_put ( & io - > refcnt , fuse_io_release ) ;
2015-02-02 14:59:43 +01:00
2015-03-16 04:33:52 -07:00
if ( iov_iter_rw ( iter ) = = WRITE ) {
2021-10-22 17:03:02 +02:00
fuse_write_update_attr ( inode , pos , ret ) ;
fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 12:40:27 +05:30
/* For extending writes we already hold exclusive lock */
2021-10-22 17:03:02 +02:00
if ( ret < 0 & & offset + count > i_size )
2012-12-18 14:05:08 +04:00
fuse_do_truncate ( file ) ;
}
2012-02-17 12:46:25 -05:00
return ret ;
}
2019-05-28 13:22:50 +02:00
static int fuse_writeback_range ( struct inode * inode , loff_t start , loff_t end )
{
2021-11-22 17:05:31 +08:00
int err = filemap_write_and_wait_range ( inode - > i_mapping , start , LLONG_MAX ) ;
2019-05-28 13:22:50 +02:00
if ( ! err )
fuse_sync_writes ( inode ) ;
return err ;
}
2012-11-10 16:55:56 +01:00
static long fuse_file_fallocate ( struct file * file , int mode , loff_t offset ,
loff_t length )
2012-04-22 18:45:24 -07:00
{
struct fuse_file * ff = file - > private_data ;
2014-12-12 10:04:51 +01:00
struct inode * inode = file_inode ( file ) ;
2013-09-13 19:20:16 +04:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff - > fm ;
2014-12-12 09:49:05 +01:00
FUSE_ARGS ( args ) ;
2012-04-22 18:45:24 -07:00
struct fuse_fallocate_in inarg = {
. fh = ff - > fh ,
. offset = offset ,
. length = length ,
. mode = mode
} ;
int err ;
2022-11-23 09:10:42 +01:00
bool block_faults = FUSE_IS_DAX ( inode ) & &
( ! ( mode & FALLOC_FL_KEEP_SIZE ) | |
( mode & ( FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE ) ) ) ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
2021-05-12 17:18:48 +01:00
if ( mode & ~ ( FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_ZERO_RANGE ) )
2014-04-28 14:19:21 +02:00
return - EOPNOTSUPP ;
2020-05-06 17:44:12 +02:00
if ( fm - > fc - > no_fallocate )
2012-04-26 10:56:36 +02:00
return - EOPNOTSUPP ;
2022-11-23 09:10:42 +01:00
inode_lock ( inode ) ;
if ( block_faults ) {
filemap_invalidate_lock ( inode - > i_mapping ) ;
err = fuse_dax_break_layouts ( inode , 0 , 0 ) ;
if ( err )
goto out ;
}
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
2022-11-23 09:10:42 +01:00
if ( mode & ( FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE ) ) {
loff_t endbyte = offset + length - 1 ;
2019-05-28 13:22:50 +02:00
2022-11-23 09:10:42 +01:00
err = fuse_writeback_range ( inode , offset , endbyte ) ;
if ( err )
goto out ;
2013-05-17 09:30:32 -04:00
}
2019-04-18 04:04:41 +08:00
if ( ! ( mode & FALLOC_FL_KEEP_SIZE ) & &
offset + length > i_size_read ( inode ) ) {
err = inode_newsize_ok ( inode , offset + length ) ;
if ( err )
2019-05-27 11:42:07 +02:00
goto out ;
2019-04-18 04:04:41 +08:00
}
2022-10-28 14:25:20 +02:00
err = file_modified ( file ) ;
if ( err )
goto out ;
2013-09-13 19:20:16 +04:00
if ( ! ( mode & FALLOC_FL_KEEP_SIZE ) )
set_bit ( FUSE_I_SIZE_UNSTABLE , & fi - > state ) ;
2019-09-10 15:04:08 +02:00
args . opcode = FUSE_FALLOCATE ;
args . nodeid = ff - > nodeid ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2012-04-26 10:56:36 +02:00
if ( err = = - ENOSYS ) {
2020-05-06 17:44:12 +02:00
fm - > fc - > no_fallocate = 1 ;
2012-04-26 10:56:36 +02:00
err = - EOPNOTSUPP ;
}
2013-05-17 15:27:34 -04:00
if ( err )
goto out ;
/* we could have extended the file */
2013-12-26 19:51:11 +04:00
if ( ! ( mode & FALLOC_FL_KEEP_SIZE ) ) {
2021-10-22 17:03:03 +02:00
if ( fuse_write_update_attr ( inode , offset + length , length ) )
2014-04-28 14:19:22 +02:00
file_update_time ( file ) ;
2013-12-26 19:51:11 +04:00
}
2013-05-17 15:27:34 -04:00
2021-05-12 17:18:48 +01:00
if ( mode & ( FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE ) )
2013-05-17 15:27:34 -04:00
truncate_pagecache_range ( inode , offset , offset + length - 1 ) ;
2021-10-22 17:03:02 +02:00
fuse_invalidate_attr_mask ( inode , FUSE_STATX_MODSIZE ) ;
2013-05-17 15:27:34 -04:00
2013-05-17 09:30:32 -04:00
out :
2013-09-13 19:20:16 +04:00
if ( ! ( mode & FALLOC_FL_KEEP_SIZE ) )
clear_bit ( FUSE_I_SIZE_UNSTABLE , & fi - > state ) ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
if ( block_faults )
2021-04-21 17:18:39 +02:00
filemap_invalidate_unlock ( inode - > i_mapping ) ;
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-19 18:19:54 -04:00
2022-11-23 09:10:42 +01:00
inode_unlock ( inode ) ;
2013-05-17 09:30:32 -04:00
2021-10-22 17:03:01 +02:00
fuse_flush_time_update ( inode ) ;
2012-04-22 18:45:24 -07:00
return err ;
}
2019-06-05 08:04:47 -07:00
static ssize_t __fuse_copy_file_range ( struct file * file_in , loff_t pos_in ,
struct file * file_out , loff_t pos_out ,
size_t len , unsigned int flags )
2018-08-21 14:36:31 +02:00
{
struct fuse_file * ff_in = file_in - > private_data ;
struct fuse_file * ff_out = file_out - > private_data ;
2019-05-28 13:22:50 +02:00
struct inode * inode_in = file_inode ( file_in ) ;
2018-08-21 14:36:31 +02:00
struct inode * inode_out = file_inode ( file_out ) ;
struct fuse_inode * fi_out = get_fuse_inode ( inode_out ) ;
2020-05-06 17:44:12 +02:00
struct fuse_mount * fm = ff_in - > fm ;
struct fuse_conn * fc = fm - > fc ;
2018-08-21 14:36:31 +02:00
FUSE_ARGS ( args ) ;
struct fuse_copy_file_range_in inarg = {
. fh_in = ff_in - > fh ,
. off_in = pos_in ,
. nodeid_out = ff_out - > nodeid ,
. fh_out = ff_out - > fh ,
. off_out = pos_out ,
. len = len ,
. flags = flags
} ;
struct fuse_write_out outarg ;
ssize_t err ;
/* mark unstable when write-back is not used, and file_out gets
* extended */
bool is_unstable = ( ! fc - > writeback_cache ) & &
( ( pos_out + len ) > inode_out - > i_size ) ;
if ( fc - > no_copy_file_range )
return - EOPNOTSUPP ;
2019-06-05 08:04:50 -07:00
if ( file_inode ( file_in ) - > i_sb ! = file_inode ( file_out ) - > i_sb )
return - EXDEV ;
2020-05-20 11:39:35 +02:00
inode_lock ( inode_in ) ;
err = fuse_writeback_range ( inode_in , pos_in , pos_in + len - 1 ) ;
inode_unlock ( inode_in ) ;
if ( err )
return err ;
2019-05-28 13:22:50 +02:00
2018-08-21 14:36:31 +02:00
inode_lock ( inode_out ) ;
2019-06-05 08:04:51 -07:00
err = file_modified ( file_out ) ;
if ( err )
goto out ;
2020-05-20 11:39:35 +02:00
/*
* Write out dirty pages in the destination file before sending the COPY
* request to userspace . After the request is completed , truncate off
* pages ( including partial ones ) from the cache that have been copied ,
* since these contain stale data at that point .
*
* This should be mostly correct , but if the COPY writes to partial
* pages ( at the start or end ) and the parts not covered by the COPY are
* written through a memory map after calling fuse_writeback_range ( ) ,
* then these partial page modifications will be lost on truncation .
*
* It is unlikely that someone would rely on such mixed style
* modifications . Yet this does give less guarantees than if the
* copying was performed with write ( 2 ) .
*
2021-04-21 17:18:39 +02:00
* To fix this a mapping - > invalidate_lock could be used to prevent new
2020-05-20 11:39:35 +02:00
* faults while the copy is ongoing .
*/
2020-05-20 11:39:35 +02:00
err = fuse_writeback_range ( inode_out , pos_out , pos_out + len - 1 ) ;
if ( err )
goto out ;
2018-08-21 14:36:31 +02:00
if ( is_unstable )
set_bit ( FUSE_I_SIZE_UNSTABLE , & fi_out - > state ) ;
2019-09-10 15:04:08 +02:00
args . opcode = FUSE_COPY_FILE_RANGE ;
args . nodeid = ff_in - > nodeid ;
args . in_numargs = 1 ;
args . in_args [ 0 ] . size = sizeof ( inarg ) ;
args . in_args [ 0 ] . value = & inarg ;
args . out_numargs = 1 ;
args . out_args [ 0 ] . size = sizeof ( outarg ) ;
args . out_args [ 0 ] . value = & outarg ;
2020-05-06 17:44:12 +02:00
err = fuse_simple_request ( fm , & args ) ;
2018-08-21 14:36:31 +02:00
if ( err = = - ENOSYS ) {
fc - > no_copy_file_range = 1 ;
err = - EOPNOTSUPP ;
}
if ( err )
goto out ;
2020-05-20 11:39:35 +02:00
truncate_inode_pages_range ( inode_out - > i_mapping ,
ALIGN_DOWN ( pos_out , PAGE_SIZE ) ,
ALIGN ( pos_out + outarg . size , PAGE_SIZE ) - 1 ) ;
2021-10-22 17:03:03 +02:00
file_update_time ( file_out ) ;
fuse_write_update_attr ( inode_out , pos_out + outarg . size , outarg . size ) ;
2018-08-21 14:36:31 +02:00
err = outarg . size ;
out :
if ( is_unstable )
clear_bit ( FUSE_I_SIZE_UNSTABLE , & fi_out - > state ) ;
inode_unlock ( inode_out ) ;
2019-06-05 08:04:51 -07:00
file_accessed ( file_in ) ;
2018-08-21 14:36:31 +02:00
2021-10-22 17:03:01 +02:00
fuse_flush_time_update ( inode_out ) ;
2018-08-21 14:36:31 +02:00
return err ;
}
2019-06-05 08:04:47 -07:00
static ssize_t fuse_copy_file_range ( struct file * src_file , loff_t src_off ,
struct file * dst_file , loff_t dst_off ,
size_t len , unsigned int flags )
{
ssize_t ret ;
ret = __fuse_copy_file_range ( src_file , src_off , dst_file , dst_off ,
len , flags ) ;
2019-06-05 08:04:50 -07:00
if ( ret = = - EOPNOTSUPP | | ret = = - EXDEV )
2019-06-05 08:04:47 -07:00
ret = generic_copy_file_range ( src_file , src_off , dst_file ,
dst_off , len , flags ) ;
return ret ;
}
2006-03-28 01:56:42 -08:00
static const struct file_operations fuse_file_operations = {
2008-04-30 00:54:45 -07:00
. llseek = fuse_file_llseek ,
2014-04-02 14:47:09 -04:00
. read_iter = fuse_file_read_iter ,
2014-04-03 14:33:23 -04:00
. write_iter = fuse_file_write_iter ,
2005-09-09 13:10:30 -07:00
. mmap = fuse_file_mmap ,
. open = fuse_open ,
. flush = fuse_flush ,
. release = fuse_release ,
. fsync = fuse_fsync ,
2006-06-25 05:48:52 -07:00
. lock = fuse_file_lock ,
2020-08-19 18:19:52 -04:00
. get_unmapped_area = thp_get_unmapped_area ,
2007-10-18 03:07:02 -07:00
. flock = fuse_file_flock ,
2023-05-22 14:50:15 +01:00
. splice_read = filemap_splice_read ,
2019-01-24 10:40:17 +01:00
. splice_write = iter_file_splice_write ,
2008-11-26 12:03:55 +01:00
. unlocked_ioctl = fuse_file_ioctl ,
. compat_ioctl = fuse_file_compat_ioctl ,
2008-11-26 12:03:55 +01:00
. poll = fuse_file_poll ,
2012-04-22 18:45:24 -07:00
. fallocate = fuse_file_fallocate ,
2019-01-24 10:40:17 +01:00
. copy_file_range = fuse_copy_file_range ,
2005-09-09 13:10:35 -07:00
} ;
2006-06-28 04:26:44 -07:00
static const struct address_space_operations fuse_file_aops = {
2022-04-29 11:12:16 -04:00
. read_folio = fuse_read_folio ,
2020-06-01 21:47:31 -07:00
. readahead = fuse_readahead ,
fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 00:54:41 -07:00
. writepage = fuse_writepage ,
2013-06-29 21:45:29 +04:00
. writepages = fuse_writepages ,
2022-02-09 20:21:56 +00:00
. launder_folio = fuse_launder_folio ,
2022-02-09 20:22:03 +00:00
. dirty_folio = filemap_dirty_folio ,
2006-12-06 20:35:51 -08:00
. bmap = fuse_bmap ,
2012-02-17 12:46:25 -05:00
. direct_IO = fuse_direct_IO ,
2013-10-10 17:11:43 +04:00
. write_begin = fuse_write_begin ,
. write_end = fuse_write_end ,
2005-09-09 13:10:30 -07:00
} ;
fuse: enable per inode DAX
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.
However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.
Per inode DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.
The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
DAX should be enabled or not for corresponding file. Currently the state
whether DAX is enabled or not for the file is initialized only when
inode is instantiated.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25 15:05:27 +08:00
void fuse_init_file_inode ( struct inode * inode , unsigned int flags )
2005-09-09 13:10:30 -07:00
{
2018-10-01 10:07:05 +02:00
struct fuse_inode * fi = get_fuse_inode ( inode ) ;
2005-09-09 13:10:37 -07:00
inode - > i_fop = & fuse_file_operations ;
inode - > i_data . a_ops = & fuse_file_aops ;
2018-10-01 10:07:05 +02:00
INIT_LIST_HEAD ( & fi - > write_files ) ;
INIT_LIST_HEAD ( & fi - > queued_writes ) ;
fi - > writectr = 0 ;
init_waitqueue_head ( & fi - > page_waitq ) ;
2019-09-19 17:11:20 +03:00
fi - > writepages = RB_ROOT ;
2020-08-19 18:19:51 -04:00
if ( IS_ENABLED ( CONFIG_FUSE_DAX ) )
fuse: enable per inode DAX
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.
However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.
Per inode DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.
The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
DAX should be enabled or not for corresponding file. Currently the state
whether DAX is enabled or not for the file is initialized only when
inode is instantiated.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25 15:05:27 +08:00
fuse_dax_inode_init ( inode , flags ) ;
2005-09-09 13:10:30 -07:00
}