linux

iv/linux

Author	SHA1	Message	Date
Kirill A. Shutemov	7caef26767	truncate: drop 'oldsize' truncate_pagecache() parameter truncate_pagecache() doesn't care about old size since commit `cedabed49b` ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Maxim Patlasov	5a53748568	mm/page-writeback.c: add strictlimit feature The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:58:04 -07:00
Linus Torvalds	16d70e1529	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse bugfixes from Miklos Szeredi: "Just a bunch of bugfixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use list_for_each_entry() for list traversing fuse: readdir: check for slash in names fuse: hotfix truncate_pagecache() issue fuse: invalidate inode attributes on xattr modification fuse: postpone end_page_writeback() in fuse_writepage_locked()	2013-09-09 09:18:23 -07:00
Anand Avati	46ea1562da	fuse: drop dentry on failed revalidate Drop a subtree when we find that it has moved or been delated. This can be done as long as there are no submounts under this location. If the directory was moved and we come across the same directory in a future lookup it will be reconnected by d_materialise_unique(). Signed-off-by: Anand Avati <avati@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:54 -04:00
Miklos Szeredi	e2a6b95236	fuse: clean up return in fuse_dentry_revalidate() On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the error itself from ->d_revalidate() insted of returning zero (invalid). Also make a common label for invalidating the dentry. This will be used by the next patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:54 -04:00
Miklos Szeredi	5835f3390e	fuse: use d_materialise_unique() Use d_materialise_unique() instead of d_splice_alias(). This allows dentry subtrees to be moved to a new place if there moved, even if something is referencing a dentry in the subtree (open fd, cwd, etc..). This will also allow us to drop a subtree if it is found to be replaced by something else. In this case the disconnected subtree can later be reconnected to its new location. d_materialise_unique() ensures that a directory entry only ever has one alias. We keep fc->inst_mutex around the calls for d_materialise_unique() on directories to prevent a race with mkdir "stealing" the inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-05 16:23:53 -04:00
Dong Fang	05726acabe	fuse: use list_for_each_entry() for list traversing Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-09-04 17:42:42 +02:00
Miklos Szeredi	efeb9e60d4	fuse: readdir: check for slash in names Userspace can add names containing a slash character to the directory listing. Don't allow this as it could cause all sorts of trouble. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 14:28:38 +02:00
Maxim Patlasov	06a7c3c278	fuse: hotfix truncate_pagecache() issue The way how fuse calls truncate_pagecache() from fuse_change_attributes() is completely wrong. Because, w/o i_mutex held, we never sure whether 'oldsize' and 'attr->size' are valid by the time of execution of truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we released fc->lock in the middle of fuse_change_attributes(), we completely loose control of actions which may happen with given inode until we reach truncate_pagecache. The list of potentially dangerous actions includes mmap-ed reads and writes, ftruncate(2) and write(2) extending file size. The typical outcome of doing truncate_pagecache() with outdated arguments is data corruption from user point of view. This is (in some sense) acceptable in cases when the issue is triggered by a change of the file on the server (i.e. externally wrt fuse operation), but it is absolutely intolerable in scenarios when a single fuse client modifies a file without any external intervention. A real life case I discovered by fsx-linux looked like this: 1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends FUSE_SETATTR to the server synchronously, but before getting fc->lock ... 2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP to the server synchronously, then calls fuse_change_attributes(). The latter updates i_size, releases fc->lock, but before comparing oldsize vs attr->size.. 3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and updating attributes and i_size, but now oldsize is equal to outarg.attr.size because i_size has just been updated (step 2). Hence, fuse_do_setattr() returns w/o calling truncate_pagecache(). 4. As soon as ftruncate(2) completes, the user extends file size by write(2) making a hole in the middle of file, then reads data from the hole either by read(2) or mmap-ed read. The user expects to get zero data from the hole, but gets stale data because truncate_pagecache() is not executed yet. The scenario above illustrates one side of the problem: not truncating the page cache even though we should. Another side corresponds to truncating page cache too late, when the state of inode changed significantly. Theoretically, the following is possible: 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size changed (due to our own fuse_do_setattr()) and is going to call truncate_pagecache() for some 'new_size' it believes valid right now. But by the time that particular truncate_pagecache() is called ... 2. fuse_do_setattr() returns (either having called truncate_pagecache() or not -- it doesn't matter). 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). 4. mmap-ed write makes a page in the extended region dirty. The result will be the lost of data user wrote on the fourth step. The patch is a hotfix resolving the issue in a simplistic way: let's skip dangerous i_size update and truncate_pagecache if an operation changing file size is in progress. This simplistic approach looks correct for the cases w/o external changes. And to handle them properly, more sophisticated and intrusive techniques (e.g. NFS-like one) would be required. I'd like to postpone it until the issue is well discussed on the mailing list(s). Changed in v2: - improved patch description to cover both sides of the issue. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:58 +02:00
Anand Avati	d331a415ae	fuse: invalidate inode attributes on xattr modification Calls like setxattr and removexattr result in updation of ctime. Therefore invalidate inode attributes to force a refresh. Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:58 +02:00
Maxim Patlasov	4a4ac4eba1	fuse: postpone end_page_writeback() in fuse_writepage_locked() The patch fixes a race between ftruncate(2), mmap-ed write and write(2): 1) An user makes a page dirty via mmap-ed write. 2) The user performs shrinking truncate(2) intended to purge the page. 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to writeback. fuse_writepage_locked fills FUSE_WRITE request and releases the original page by end_page_writeback. 4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex is free. 5) Ordinary write(2) extends i_size back to cover the page. Note that fuse_send_write_pages do wait for fuse writeback, but for another page->index. 6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request. fuse_send_writepage is supposed to crop inarg->size of the request, but it doesn't because i_size has already been extended back. Moving end_page_writeback to the end of fuse_writepage_locked fixes the race because now the fact that truncate_pagecache is successfully returned infers that fuse_writepage_locked has already called end_page_writeback. And this, in turn, infers that fuse_flush_writepages has already called fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2) could not extend it because of i_mutex held by ftruncate(2). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-03 13:41:57 +02:00
Greg Kroah-Hartman	b78b6b3a9a	Merge 3.11-rc3 into driver-core-next We want these fixes in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-07-29 12:30:13 -07:00
Greg Kroah-Hartman	4183fb9503	cuse: convert class code to use dev_groups The dev_attrs field of struct class is going away soon, dev_groups should be used instead. This converts the cuse class code to use the correct field. Acked-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-07-26 18:05:18 -07:00
Miklos Szeredi	c7263bcdc4	fuse: readdirplus: cleanup Niels noted that we don't need the 'dentry = NULL' line. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Niels de Vos <ndevos@redhat.com>	2013-07-17 14:53:54 +02:00
Miklos Szeredi	fa2b721360	fuse: readdirplus: change attributes once If we got the inode through fuse_iget() then the attributes are already up-to-date. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-07-17 14:53:53 +02:00
Miklos Szeredi	2914941e31	fuse: readdirplus: fix instantiate Fuse does instantiation slightly differently from NFS/CIFS which use d_materialise_unique(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org	2013-07-17 14:53:53 +02:00
Miklos Szeredi	a28ef45cbb	fuse: readdirplus: sanity checks Add sanity checks before adding or updating an entry with data received from readdirplus. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org	2013-07-17 14:53:53 +02:00
Niels de Vos	53ce9a3364	fuse: readdirplus: fix dentry leak In case d_lookup() returns a dentry with d_inode == NULL, the dentry is not returned with dput(). This results in triggering a BUG() in shrink_dcache_for_umount_subtree(): BUG: Dentry ...{i=0,n=...} still in use (1) [unmount of fuse fuse] [SzM: need to d_drop() as well] Reported-by: Justin Clift <jclift@redhat.com> Signed-off-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Brian Foster <bfoster@redhat.com> Tested-by: Niels de Vos <ndevos@redhat.com> CC: stable@vger.kernel.org	2013-07-17 14:53:53 +02:00
Jiang Liu	0ed5fd1385	mm: use totalram_pages instead of num_physpages at runtime The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-07-03 16:07:35 -07:00
Linus Torvalds	790eac5640	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...	2013-07-03 09:10:19 -07:00
Linus Torvalds	63580e51bb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS patches (part 1) from Al Viro: "The major change in this pile is ->readdir() replacement with ->iterate(), dealing with ->f_pos races in ->readdir() instances for good. There's a lot more, but I'd prefer to split the pull request into several stages and this is the first obvious cutoff point." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits) [readdir] constify ->actor [readdir] ->readdir() is gone [readdir] convert ecryptfs [readdir] convert coda [readdir] convert ocfs2 [readdir] convert fatfs [readdir] convert xfs [readdir] convert btrfs [readdir] convert hostfs [readdir] convert afs [readdir] convert ncpfs [readdir] convert hfsplus [readdir] convert hfs [readdir] convert befs [readdir] convert cifs [readdir] convert freevxfs [readdir] convert fuse [readdir] convert hpfs reiserfs: switch reiserfs_readdir_dentry to inode reiserfs: is_privroot_deh() needs only directory inode, actually ...	2013-07-02 09:28:37 -07:00
Al Viro	cb5e05d1a6	fuse: another open-coded file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:57:24 +04:00
Al Viro	8d3af7f333	[readdir] convert fuse Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:56:52 +04:00
Maxim Patlasov	14c14414d1	fuse: hold i_mutex in fuse_file_fallocate() Changing size of a file on server and local update (fuse_write_update_size) should be always protected by inode->i_mutex. Otherwise a race like this is possible: 1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE). fuse_file_fallocate() sends FUSE_FALLOCATE request to the server. 2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr() sends shrinking FUSE_SETATTR request to the server and updates local i_size by i_size_write(inode, outarg.attr.size). 3. Process 'A' resumes execution of fuse_file_fallocate() and calls fuse_write_update_size(inode, offset + length). But 'offset + length' was obsoleted by ftruncate from previous step. Changed in v2 (thanks Brian and Anand for suggestions): - made relation between mutex_lock() and fuse_set_nowrite(inode) more explicit and clear. - updated patch description to use ftruncate(2) in example Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-06-18 01:39:03 +02:00
Maxim Patlasov	e5c5f05dca	fuse: fix alignment in short read optimization for async_dio The bug was introduced with async_dio feature: trying to optimize short reads, we cut number-of-bytes-to-read to i_size boundary. Hence the following example: truncate --size=300 /mnt/file dd if=/mnt/file of=/dev/null iflag=direct led to FUSE_READ request of 300 bytes size. This turned out to be problem for userspace fuse implementations who rely on assumption that kernel fuse does not change alignment of request from client FS. The patch turns off the optimization if async_dio is disabled. And, if it's enabled, the patch fixes adjustment of number-of-bytes-to-read to preserve alignment. Note, that we cannot throw out short read optimization entirely because otherwise a direct read of a huge size issued on a tiny file would generate a huge amount of fuse requests and most of them would be ACKed by userspace with zero bytes read. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-06-03 15:15:42 +02:00
Brian Foster	c9ecf989cc	fuse: return -EIOCBQUEUED from fuse_direct_IO() for all async requests If request submission fails for an async request (i.e., get_user_pages() returns -ERESTARTSYS), we currently skip the -EIOCBQUEUED return and drop into wait_for_sync_kiocb() forever. Avoid this by always returning -EIOCBQUEUED for async requests. If an error occurs, the error is passed into fuse_aio_complete(), returned via aio_complete() and thus propagated to userspace via io_getevents(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-06-03 15:15:42 +02:00
Miklos Szeredi	28420dad23	fuse: fix readdirplus Oops in fuse_dentry_revalidate Fix bug introduced by commit `4582a4ab2a` "FUSE: Adapt readdirplus to application usage patterns". We need to check for a positive dentry; negative dentries are not added by readdirplus. Secondly we need to advise the use of readdirplus on the parent, otherwise the whole thing is useless. Thirdly all this is only relevant if "readdirplus_auto" mode is selected by the filesystem. We advise the use of readdirplus only if the dentry was still valid. If we had to redo the lookup then there was no use in doing the -plus version. Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Feng Shuo <steve.shuo.feng@gmail.com> CC: stable@vger.kernel.org	2013-06-03 14:40:22 +02:00
Brian Foster	bee6c30780	fuse: update inode size and invalidate attributes on fallocate An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the size of a file. Update the inode size after a successful fallocate. Also invalidate the inode attributes after a successful fallocate to ensure we pick up the latest attribute values (i.e., i_blocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-05-20 16:58:58 +02:00
Brian Foster	3634a63278	fuse: truncate pagecache range on hole punch fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE interface. When a hole punch is passed through, the page cache is not cleared and thus allows reading stale data from the cache. This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a smallish random data file into cache, punching a hole and creating a copy of the file. Drop caches or remount and observe that the original file no longer matches the file copied after the hole punch. The original file contains a zeroed range and the latter file contains stale data. Protect against writepage requests in progress and punch out the associated page cache range after a successful client fs hole punch. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-05-20 16:58:57 +02:00
Brian Foster	de82b92301	fuse: allocate for_background dio requests based on io->async state Commit `8b41e671` introduced explicit background checking for fuse_req structures with BUG_ON() checks for the appropriate type of request in in the associated send functions. Commit `bcba24cc` introduced the ability to send dio requests as background requests but does not update the request allocation based on the type of I/O request. As a result, a BUG_ON() triggers in the fuse_request_send_background() background path if an async I/O is sent. Allocate a request based on the async state of the fuse_io_priv to avoid the BUG. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-05-14 18:25:44 +02:00
Linus Torvalds	5af43c24ca	Merge branch 'akpm' (incoming from Andrew) Merge more incoming from Andrew Morton: - Various fixes which were stalled or which I picked up recently - A large rotorooting of the AIO code. Allegedly to improve performance but I don't really have good performance numbers (I might have lost the email) and I can't raise Kent today. I held this out of 3.9 and we could give it another cycle if it's all too late/scary. I ended up taking only the first two thirds of the AIO rotorooting. I left the percpu parts and the batch completion for later. - Linus * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits) aio: don't include aio.h in sched.h aio: kill ki_retry aio: kill ki_key aio: give shared kioctx fields their own cachelines aio: kill struct aio_ring_info aio: kill batch allocation aio: change reqs_active to include unreaped completions aio: use cancellation list lazily aio: use flush_dcache_page() aio: make aio_read_evt() more efficient, convert to hrtimers wait: add wait_event_hrtimeout() aio: refcounting cleanup aio: make aio_put_req() lockless aio: do fget() after aio_get_req() aio: dprintk() -> pr_debug() aio: move private stuff out of aio.h aio: add kiocb_cancel() aio: kill return value of aio_complete() char: add aio_{read,write} to /dev/{null,zero} aio: remove retry-based AIO ...	2013-05-07 20:49:51 -07:00
Kent Overstreet	a27bb332c0	aio: don't include aio.h in sched.h Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-05-07 20:16:25 -07:00
Linus Torvalds	a26ea93a3d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This contains two patchsets from Maxim Patlasov. The first reworks the request throttling so that only async requests are throttled. Wakeup of waiting async requests is also optimized. The second series adds support for async processing of direct IO which optimizes direct IO and enables the use of the AIO userspace interface." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add flag to turn on async direct IO fuse: truncate file if async dio failed fuse: optimize short direct reads fuse: enable asynchronous processing direct IO fuse: make fuse_direct_io() aware about AIO fuse: add support of async IO fuse: move fuse_release_user_pages() up fuse: optimize wake_up fuse: implement exclusive wakeup for blocked_waitq fuse: skip blocking on allocations of synchronous requests fuse: add flag fc->initialized fuse: make request allocations for background processing explicit	2013-05-07 10:12:32 -07:00
Miklos Szeredi	60b9df7a54	fuse: add flag to turn on async direct IO Without async DIO write requests to a single file were always serialized. With async DIO that's no longer the case. So don't turn on async DIO by default for fear of breaking backward compatibility. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-05-01 14:37:21 +02:00
Maxim Patlasov	efb9fa9e91	fuse: truncate file if async dio failed The patch improves error handling in fuse_direct_IO(): if we successfully submitted several fuse requests on behalf of synchronous direct write extending file and some of them failed, let's try to do our best to clean-up. Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-18 10:55:24 +02:00
Maxim Patlasov	439ee5f0c5	fuse: optimize short direct reads If user requested direct read beyond EOF, we can skip sending fuse requests for positions beyond EOF because userspace would ACK them with zero bytes read anyway. We can trust to i_size in fuse_direct_IO for such cases because it's called from fuse_file_aio_read() and the latter updates fuse attributes including i_size. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:59 +02:00
Maxim Patlasov	bcba24ccdc	fuse: enable asynchronous processing direct IO In case of synchronous DIO request (i.e. read(2) or write(2) for a file opened with O_DIRECT), the patch submits fuse requests asynchronously, but waits for their completions before return from fuse_direct_IO(). In case of asynchronous DIO request (i.e. libaio io_submit() or a file opened with O_DIRECT), the patch submits fuse requests asynchronously and return -EIOCBQUEUED immediately. The only special case is async DIO extending file. Here the patch falls back to old behaviour because we can't return -EIOCBQUEUED and update i_size later, without i_mutex hold. And we have no method to wait on real async I/O requests. The patch also clean __fuse_direct_write() up: it's better to update i_size in its callers. Thanks Brian for suggestion. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:59 +02:00
Maxim Patlasov	36cf66ed9f	fuse: make fuse_direct_io() aware about AIO The patch implements passing "struct fuse_io_priv *io" down the stack up to fuse_send_read/write where it is used to submit request asynchronously. io->async==0 designates synchronous processing. Non-trivial part of the patch is changes in fuse_direct_io(): resources like fuse requests and user pages cannot be released immediately in async case. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:59 +02:00
Maxim Patlasov	01e9d11a3e	fuse: add support of async IO The patch implements a framework to process an IO request asynchronously. The idea is to associate several fuse requests with a single kiocb by means of fuse_io_priv structure. The structure plays the same role for FUSE as 'struct dio' for direct-io.c. The framework is supposed to be used like this: - someone (who wants to process an IO asynchronously) allocates fuse_io_priv and initializes it setting 'async' field to non-zero value. - as soon as fuse request is filled, it can be submitted (in non-blocking way) by fuse_async_req_send() - when all submitted requests are ACKed by userspace, io->reqs drops to zero triggering aio_complete() In case of IO initiated by libaio, aio_complete() will finish processing the same way as in case of dio_complete() calling aio_complete(). But the framework may be also used for internal FUSE use when initial IO request was synchronous (from user perspective), but it's beneficial to process it asynchronously. Then the caller should wait on kiocb explicitly and aio_complete() will wake the caller up. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:59 +02:00
Maxim Patlasov	187c5c3633	fuse: move fuse_release_user_pages() up fuse_release_user_pages() will be indirectly used by fuse_send_read/write in future patches. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:58 +02:00
Miklos Szeredi	3c18ef8117	fuse: optimize wake_up Normally blocked_waitq will be inactive, so optimize this case. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 21:50:58 +02:00
Maxim Patlasov	722d2bea8c	fuse: implement exclusive wakeup for blocked_waitq The patch solves thundering herd problem. So far as previous patches ensured that only allocations for background may block, it's safe to wake up one waiter. Whoever it is, it will wake up another one in request_end() afterwards. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 12:31:45 +02:00
Maxim Patlasov	0aada88476	fuse: skip blocking on allocations of synchronous requests A task may have at most one synchronous request allocated. So these requests need not be otherwise limited. The patch re-works fuse_get_req() to follow this idea. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 12:31:45 +02:00
Maxim Patlasov	796523fb24	fuse: add flag fc->initialized Existing flag fc->blocked is used to suspend request allocation both in case of many background request submitted and period of time before init_reply arrives from userspace. Next patch will skip blocking allocations of synchronous request (disregarding fc->blocked). This is mostly OK, but we still need to suspend allocations if init_reply is not arrived yet. The patch introduces flag fc->initialized which will serve this purpose. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 12:31:44 +02:00
Maxim Patlasov	8b41e6715e	fuse: make request allocations for background processing explicit There are two types of processing requests in FUSE: synchronous (via fuse_request_send()) and asynchronous (via adding to fc->bg_queue). Fortunately, the type of processing is always known in advance, at the time of request allocation. This preparatory patch utilizes this fact making fuse_get_req() aware about the type. Next patches will use it. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-04-17 12:31:44 +02:00
Al Viro	6447a3cf19	get rid of pipe->inode it's used only as a flag to distinguish normal pipes/FIFOs from the internal per-task one used by file-to-file splice. And pipe->files would work just as well for that purpose... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:13:01 -04:00
Al Viro	8d71db4f08	lift sb_start_write/sb_end_write out of ->aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:12:55 -04:00
Eric W. Biederman	7f78e03513	fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-03-03 19:36:31 -08:00
Al Viro	6131ffaa1f	more file_inode() open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-27 16:59:05 -05:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00

1 2 3 4 5 ...

511 Commits