linux

iv/linux

Author	SHA1	Message	Date
Al Viro	ff01bb4832	fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:07 -05:00
Hugh Dickins	708e3508c2	tmpfs: clone shmem_file_splice_read() Copy __generic_file_splice_read() and generic_file_splice_read() from fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make page_cache_pipe_buf_ops and spd_release_page() accessible to it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:11 -07:00
Namhyung Kim	825cdcb1a5	splice: add wakeup_pipe_readers() Add and use wakeup_pipe_readers() to consolidate duplicated codes. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 19:58:53 +02:00
Linus Torvalds	275220f0fc	Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...	2011-01-13 10:45:01 -08:00
Michał Mirosław	a8adbe378b	fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors This patch pulls calls to buf->ops->confirm() from all actors passed (also indirectly) to splice_from_pipe_feed(). Is avoiding the call to buf->ops->confirm() while splice()ing to /dev/null is an intentional optimization? No other user does that and this will remove this special case. Against current linux.git `6313e3c217`. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-12-17 08:56:44 +01:00
Linus Torvalds	c66fb34794	Export 'get_pipe_info()' to other users And in particular, use it in 'pipe_fcntl()'. The other pipe functions do not need to use the 'careful' version, since they are only ever called for things that are already known to be pipes. The normal read/write/ioctl functions are called through the file operations structures, so if a file isn't a pipe, they'd never get called. But pipe_fcntl() is special, and called directly from the generic fcntl code, and needs to use the same careful function that the splice code is using. Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-11-28 14:09:57 -08:00
Linus Torvalds	71993e62a4	Rename 'pipe_info()' to 'get_pipe_info()' .. and change it to take the 'file' pointer instead of an inode, since that's what all users want anyway. The renaming is preparatory to exporting it to other users. The old 'pipe_info()' name was too generic and is already used elsewhere, so before making the function public we need to use a more specific name. Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-11-28 13:56:09 -08:00
Miklos Szeredi	6965031d33	splice: fix misuse of SPLICE_F_NONBLOCK SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the pipe. In __generic_file_splice_read(), however, it causes an EAGAIN if the page is currently being read. This makes it impossible to write an application that only wants failure if the pipe is full. For example if the same process is handling both ends of a pipe and isn't otherwise able to determine whether a splice to the pipe will fill it or not. We could make the read non-blocking on O_NONBLOCK or some other splice flag, but for now this is the simplest fix. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:52:56 +02:00
Andi Kleen	1676effca4	gcc-4.6: fs: fix unused but set warnings No real bugs I believe, just some dead code, and some shut up code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:12 +02:00
Changli Gao	19c9a49b43	splice: check f_mode for seekable file check f_mode for seekable file As a seekable file is allowed without a llseek function, so the old way isn't work any more. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> ---- fs/splice.c \| 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-30 08:12:37 +02:00
Changli Gao	2cb4b05e76	splice: direct_splice_actor() should not use pos in sd direct_splice_actor() shouldn't use sd->pos, as sd->pos is for file reading, file->f_pos should be used instead. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> ---- fs/splice.c \| 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-30 08:12:37 +02:00
Nick Piggin	0ae0b5d055	fs/splice.c: fix mapping_gfp_mask usage mapping_gfp_mask() is not supposed to store allocation contex details, only page location details. So mapping_gfp_mask should be applied to the pagecache page allocation, wheras normal (kernel mapped) memory should be used for surrounding allocations such as radix-tree nodes allocated by add_to_page_cache. Context modifiers should be applied on a per-callsite basis. So change splice to follow this convention (which is followed in similar code patterns in core code). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-25 10:25:26 +02:00
Jens Axboe	35f3d14dbb	pipe: add support for shrinking and growing pipes This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for growing and shrinking the size of a pipe and adjusts pipe.c and splice.c (and relay and network splice) usage to work with these larger (or smaller) pipes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:12:40 +02:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Changli Gao	cc56f7de7f	sendfile(): check f_op.splice_write() rather than f_op.sendpage() sendfile(2) was reworked with the splice infrastructure, but it still checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although if f_op.sendpage() exists, f_op.splice_write() always exists at the same time currently, the assumption will be broken in future silently. This patch also brings a side effect: sendfile(2) can work with any output file. Some security checks related to f_op are added too. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-04 09:09:52 +01:00
Linus Torvalds	355bbd8cb8	Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...	2009-09-14 17:55:15 -07:00
Jan Kara	148f948ba8	vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2009-09-14 17:08:15 +02:00
Miklos Szeredi	723590ed52	splice: update mtime and atime on files Splice should update the modification and access times on regular files just like read and write. Not updating mtime will confuse backup tools, etc... This patch only adds the time updates for regular files. For pipes and other special files that splice touches the need for updating the times is less clear. Let's discuss and fix that separately. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:34:33 +02:00
Miklos Szeredi	b2858d7d16	splice: fix kmaps in default_file_splice_write() Unfortunately multiple kmap() within a single thread are deadlockable, so writing out multiple buffers with writev() isn't possible. Change the implementation so that it does a separate write() for each buffer. This actually simplifies the code a lot since the splice_from_pipe() helper can be used. This limitation is caused by HIGHMEM pages, and so only affects a subset of architectures and configurations. In the future it may be worth to implement default_file_splice_write() in a more efficient way on configs that allow it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 11:37:46 +02:00
Andrew Morton	77f6bf57ba	splice: fix error return code fs/splice.c: In function 'default_file_splice_read': fs/splice.c:566: warning: 'error' may be used uninitialized in this function which is sort-of true. The code will in fact return -ENOMEM instead of the kernel_readv() return value. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-14 09:49:44 +02:00
Jens Axboe	4f23122858	splice: fix repeated kmap()'s in default_file_splice_read() We cannot reliably map more than one page at the time, or we risk deadlocking. Just allocate the pages from low mem instead. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-13 08:35:35 +02:00
Miklos Szeredi	0b0a47f5c4	splice: implement default splice_write method If f_op->splice_write() is not implemented, fall back to a plain write. Use vfs_writev() to write from the pipe buffers. This will allow splice on all filesystems and file types. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:10 +02:00
Miklos Szeredi	6818173bd6	splice: implement default splice_read method If f_op->splice_read() is not implemented, fall back to a plain read. Use vfs_readv() to read into previously allocated pages. This will allow splice and functions using splice, such as the loop device, to work on all filesystems. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:10 +02:00
Miklos Szeredi	7c77f0b3f9	splice: implement pipe to pipe splicing Allow splice(2) to work when both the input and the output is a pipe. Based on the impementation of the tee(2) syscall, but instead of duplicating the buffer references move the buffers from the input pipe to the output pipe. Moving the whole buffer only succeeds if the full length of the buffer is spliced. Otherwise duplicate the buffer, just like tee(2), set the length of the output buffer and advance the offset on the input buffer. Since splice is operating on two pipes, special care needs to be taken with locking to prevent AN ABBA deadlock. Again this is done similarly to the tee(2) syscall, first preparing the input and output pipes so there's data to consume and space for that data, and then doing the move operation while holding both locks. If other processes are doing I/O on the same pipes parallel to the splice, then by the time both inodes are locked there might be no buffers left to move, or no space to move them to. In this case retry the whole operation, including the preparation phase. This could lead to starvation, but I'm not sure if that's serious enough to worry about. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:09 +02:00
Randy Dunlap	b80901bbf5	splice: fix new kernel-doc warnings splice: fix kernel-doc warnings Warning(fs/splice.c:617): bad line: Warning(fs/splice.c:722): No description found for parameter 'sd' Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-17 07:38:07 -07:00
Miklos Szeredi	61e0d47c33	splice: add helpers for locking pipe inode There are lots of sequences like this, especially in splice code: if (pipe->inode) mutex_lock(&pipe->inode->i_mutex); /* do something */ if (pipe->inode) mutex_unlock(&pipe->inode->i_mutex); so introduce helpers which do the conditional locking and unlocking. Also replace the inode_double_lock() call with a pipe_double_lock() helper to avoid spreading the use of this functionality beyond the pipe code. This patch is just a cleanup, and should cause no behavioral changes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	f8cc774ce4	splice: remove generic_file_splice_write_nolock() Remove the now unused generic_file_splice_write_nolock() function. It's conceptually broken anyway, because splice may need to wait for pipe events so holding locks across the whole operation is wrong. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	328eaaba4e	ocfs2: fix i_mutex locking in ocfs2_splice_to_file() Rearrange locking of i_mutex on destination and call to ocfs2_rw_lock() so locks are only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	eb443e5a25	splice: fix i_mutex locking in generic_splice_write() Rearrange locking of i_mutex on destination so it's only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Miklos Szeredi	2933970b96	splice: remove i_mutex locking in splice_from_pipe() splice_from_pipe() is only called from two places: - generic_splice_sendpage() - splice_write_null() Neither of these require i_mutex to be taken on the destination inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Miklos Szeredi	b3c2d2ddd6	splice: split up __splice_from_pipe() Split up __splice_from_pipe() into four helper functions: splice_from_pipe_begin() splice_from_pipe_next() splice_from_pipe_feed() splice_from_pipe_end() splice_from_pipe_next() will wait (if necessary) for more buffers to be added to the pipe. splice_from_pipe_feed() will feed the buffers to the supplied actor and return when there's no more data available (or if all of the requested data has been copied). This is necessary so that implementations can do locking around the non-waiting splice_from_pipe_feed(). This patch should not cause any change in behavior. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Miklos Szeredi	7bfac9ecf0	splice: fix deadlock in splicing to file There's a possible deadlock in generic_file_splice_write(), splice_from_pipe() and ocfs2_file_splice_write(): - task A calls generic_file_splice_write() - this calls inode_double_lock(), which locks i_mutex on both pipe->inode and target inode - ordering depends on inode pointers, can happen that pipe->inode is locked first - __splice_from_pipe() needs more data, calls pipe_wait() - this releases lock on pipe->inode, goes to interruptible sleep - task B calls generic_file_splice_write(), similarly to the first - this locks pipe->inode, then tries to lock inode, but that is already held by task A - task A is interrupted, it tries to lock pipe->inode, but fails, as it is already held by task B - ABBA deadlock Fix this by explicitly ordering locks: the outer lock must be on target inode and the inner lock (which is later unlocked and relocked) must be on pipe->inode. This is OK, pipe inodes and target inodes form two nonoverlapping sets, generic_file_splice_write() and friends are not called with a target which is a pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:34:46 -07:00
David Howells	266cf658ef	FS-Cache: Recruit a page flags for cache management Recruit a page flag to aid in cache management. The following extra flag is defined: (1) PG_fscache (PG_private_2) The marked page is backed by a local cache and is pinning resources in the cache driver. If PG_fscache is set, then things that checked for PG_private will now also check for that. This includes things like truncation and page invalidation. The function page_has_private() had been added to make the checks for both PG_private and PG_private_2 at the same time. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>	2009-04-03 16:42:36 +01:00
Heiko Carstens	836f92adf1	[CVE-2009-0029] System call wrappers part 31 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
KAMEZAWA Hiroyuki	08e552c69c	memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:05 -08:00
Nick Piggin	4e02ed4b4a	fs: remove prepare_write/commit_write Nothing uses prepare_write or commit_write. Remove them from the tree completely. [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-30 11:38:45 -07:00
Linus Torvalds	efc968d450	Don't allow splice() to files opened with O_APPEND This is debatable, but while we're debating it, let's disallow the combination of splice and an O_APPEND destination. It's not entirely clear what the semantics of O_APPEND should be, and POSIX apparently expects pwrite() to ignore O_APPEND, for example. So we could make up any semantics we want, including the old ones. But Miklos convinced me that we should at least give it some thought, and that accepting writes at arbitrary offsets is wrong at least for IS_APPEND() files (which always have O_APPEND set, even if the reverse isn't true: you can obviously have O_APPEND set on a regular file). So disallow O_APPEND entirely for now. I doubt anybody cares, and this way we have one less gray area to worry about. Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Jens Axboe <ens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-09 14:26:38 -07:00
Nick Piggin	529ae9aaa0	mm: rename page trylock Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-04 21:31:34 -07:00
Miklos Szeredi	2f1936b877	[patch 3/5] vfs: change remove_suid() to file_remove_suid() All calls to remove_suid() are made with a file pointer, because (similarly to file_update_time) it is called when the file is written. Clean up callers by passing in a file instead of a dentry. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-07-26 20:53:16 -04:00
Nick Piggin	bc40d73c95	splice: use get_user_pages_fast Use get_user_pages_fast in splice. This reverts some mmap_sem batching there, however the biggest problem with mmap_sem tends to be hold times blocking out other threads rather than cacheline bouncing. Further: on architectures that implement get_user_pages_fast without locks, mmap_sem can be avoided completely anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:06 -07:00
Miklos Szeredi	32502b8413	splice: fix generic_file_splice_read() race with page invalidation If a page was invalidated during splicing from file to a pipe, then generic_file_splice_read() could return a short or zero count. This manifested itself in rare I/O errors seen on nfs exported fuse filesystems. This is because nfsd uses splice_direct_to_actor() to read files, and fuse uses invalidate_inode_pages2() to invalidate stale data on open. Fix by redoing the page find/create if it was found to be truncated (invalidated). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-07-04 09:52:14 +02:00
Jens Axboe	ca39d651d1	splice: handle try_to_release_page() failure splice currently assumes that try_to_release_page() always suceeds, but it can return failure. If it does, we cannot steal the page. Acked-by: Mingming Cao <cmm@us.ibm.com Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-05-28 14:49:27 +02:00
Tom Zanussi	a82c53a0e3	splice: fix sendfile() issue with relay Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-05-28 14:49:27 +02:00
Jens Axboe	75065ff619	Revert "relay: fix splice problem" This reverts commit `c3270e577c`.	2008-05-08 14:06:19 +02:00
Miklos Szeredi	7f3d4ee108	vfs: splice remove_suid() cleanup generic_file_splice_write() duplicates remove_suid() just because it doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe() anyway, so this is rather pointless. Move locking to generic_file_splice_write() and call remove_suid() and __splice_from_pipe() instead. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-05-07 09:29:00 +02:00
Tom Zanussi	c3270e577c	relay: fix splice problem Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 09:48:15 +02:00
Jens Axboe	8191ecd1d1	splice: fix infinite loop in generic_file_splice_read() There's a quirky loop in generic_file_splice_read() that could go on indefinitely, if the file splice returns 0 permanently (and not just as a temporary condition). Get rid of the loop and pass back -EAGAIN correctly from __generic_file_splice_read(), so we handle that condition properly as well. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-10 08:24:25 +02:00
Hugh Dickins	4cd1350465	splice: use mapping_gfp_mask The loop block driver is careful to mask __GFP_IO\|__GFP_FS out of its mapping_gfp_mask, to avoid hangs under memory pressure. But nowadays it uses splice, usually going through __generic_file_splice_read. That must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-03 15:39:49 -07:00
Jens Axboe	02cf01aea5	splice: only return -EAGAIN if there's hope of more data sys_tee() currently is a bit eager in returning -EAGAIN, it may do so even if we don't have a chance of anymore data becoming available. So improve the logic and only return -EAGAIN if we have an attached writer to the input pipe. Reported by Johann Felix Soden <johfel@gmx.de> and Patrick McManus <mcmanus@ducksong.com>. Tested-by: Johann Felix Soden <johfel@users.sourceforge.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-03-04 11:14:39 +01:00
Bastian Blank	712a30e63c	splice: fix user pointer access in get_iovec_page_array() Commit `8811930dc7` ("splice: missing user pointer access verification") added the proper access_ok() calls to copy_from_user_mmap_sem() which ensures we can copy the struct iovecs from userspace to the kernel. But we also must check whether we can access the actual memory region pointed to by the struct iovec to fix the access checks properly. Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Oliver Pinter <oliver.pntr@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-10 10:27:21 -08:00
Jens Axboe	8811930dc7	splice: missing user pointer access verification vmsplice_to_user() must always check the user pointer and length with access_ok() before copying. Likewise, for the slow path of copy_from_user_mmap_sem() we need to check that we may read from the user region. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Cc: Wojciech Purczynski <cliph@research.coseinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:25:01 -08:00
Jens Axboe	8084870854	splice: always updated atime in direct splice Andre Majorel <aym-xunil@teaser.fr> points out that if we only updated the atime when we transfer some data, we deviate from the standard of always updating the atime. So change splice to always call file_accessed() even if splice_direct_to_actor() didn't transfer any data. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-02-01 09:26:32 +01:00
Jens Axboe	9e97198dbf	splice: fix problem with atime not being updated A bug report on nfsd that states that since it was switched to use splice instead of sendfile, the atime was no longer being updated on the input file. do_generic_mapping_read() does this when accessing the file, make splice do it for the direct splice handler. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-01-29 21:55:20 +01:00
Jens Axboe	bbdfc2f706	[SPLICE]: Don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:53:30 -08:00
James Morris	c43e259cc7	security: call security_file_permission from rw_verify_area All instances of rw_verify_area() are followed by a call to security_file_permission(), so just call the latter from the former. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2008-01-25 11:29:52 +11:00
Serge E. Hallyn	b53767719b	Implement file posix capabilities Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Linus Torvalds	92d15c2ccb	Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block * 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits) Fix memory leak in dm-crypt SPARC64: sg chaining support SPARC: sg chaining support PPC: sg chaining support PS3: sg chaining support IA64: sg chaining support x86-64: enable sg chaining x86-64: update pci-gart iommu to sg helpers x86-64: update nommu to sg helpers x86-64: update calgary iommu to sg helpers swiotlb: sg chaining support i386: enable sg chaining i386 dma_map_sg: convert to using sg helpers mmc: need to zero sglist on init Panic in blk_rq_map_sg() from CCISS driver remove sglist_len remove blk_queue_max_phys_segments in libata revert sg segment size ifdefs Fixup u14-34f ENABLE_SG_CHAINING qla1280: enable use_sg_chaining option ...	2007-10-16 10:09:16 -07:00
Nick Piggin	afddba49d1	fs: introduce write_begin, write_end, and perform_write aops These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). [mark.fasheh@oracle.com: API design contributions, code review and fixes] [akpm@linux-foundation.org: various fixes] [dmonakhov@sw.ru: new aop block_write_begin fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:55 -07:00
Fengguang Wu	f4e6b498d6	readahead: combine file_ra_state.prev_index/prev_offset into prev_pos Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:52 -07:00
Jens Axboe	6866bef40d	splice: fix double kunmap() in vmsplice copy path The out label should not include the unmap, the only way to jump there already has unmapped the source. 00002000 f7c21a00 00000000 00000000 c0489036 00018e32 00000002 00000000 00001000 Call Trace: [<c0487dd9>] pipe_to_user+0xca/0xd3 [<c0488233>] __splice_from_pipe+0x53/0x1bd [<c0454947>] ------------[ cut here ]------------ filemap_fault+0x221/0x380 [<c0487d0f>] pipe_to_user+0x0/0xd3 [<c0489036>] sys_vmsplice+0x3b7/0x422 [<c045ec3f>] kernel BUG at mm/highmem.c:206! handle_mm_fault+0x4d5/0x8eb [<c041ed5b>] kmap_atomic+0x1c/0x20 [<c045d33d>] unmap_vmas+0x3d1/0x584 [<c045f717>] free_pgtables+0x90/0xa0 [<c041d84b>] pgd_dtor+0x0/0x1 [<c044d665>] audit_syscall_exit+0x2aa/0x2c6 [<c0407817>] do_syscall_trace+0x124/0x169 [<c0404df2>] syscall_call+0x7/0xb ======================= Code: 2d 00 d0 5b 00 25 00 00 e0 ff 29 invalid opcode: 0000 [#1] c2 89 d0 c1 e8 0c 8b 14 85 a0 6c 7c c0 4a 85 d2 89 14 85 a0 6c 7c c0 74 07 31 c9 4a 75 15 eb 04 <0f> 0b eb fe 31 c9 81 3d 78 38 6d c0 78 38 6d c0 0f 95 c1 b0 01 EIP: [<c045bbc3>] kunmap_high+0x51/0x8e SS:ESP 0068:f5960df0 SMP Modules linked in: netconsole autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video output sbs batteryac parport_pc lp parport sg i2c_piix4 i2c_core floppy cfi_probe gen_probe scb2_flash mtd chipreg tg3 e1000 button ide_cd serio_raw cdrom aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 3 EIP: 0060:[<c045bbc3>] Not tainted VLI EFLAGS: 00010246 (2.6.23 #1) EIP is at kunmap_high+0x51/0x8e Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-16 10:01:29 +02:00
Linus Torvalds	7572395767	Fix possible splice() mmap_sem deadlock Nick Piggin points out that splice isn't being good about the mmap semaphore: while two readers can nest inside each others, it does leave a possible deadlock if a writer (ie a new mmap()) comes in during that nesting. Original "just move the locking" patch by Nick, replaced by one by me based on an optimistic pagefault_disable(). And then Jens tested and updated that patch. Reported-by: Nick Piggin <npiggin@suse.de> Tested-by: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-01 13:17:28 -07:00
Randy Dunlap	79685b8dee	docbook: add pipes, other fixes Fix some typos in pipe.c and splice.c. Add pipes API to kernel-api.tmpl. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-27 08:08:51 +02:00
Jens Axboe	6a860c979b	splice: fix bad unlock_page() in error case If add_to_page_cache_lru() fails, the page will not be locked. But splice jumps to an error path that does a page release and unlock, causing a BUG() in unlock_page(). Fix this by adding one more label that just releases the page. This bug was actually triggered on EL5 by gurudas pai <gurudas.pai@oracle.com> using fio. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-20 09:07:01 -07:00
Rusty Russell	cf914a7d65	readahead: split ondemand readahead interface into two functions Split ondemand readahead interface into two functions. I think this makes it a little clearer for non-readahead experts (like Rusty). Internally they both call ondemand_readahead(), but the page argument is changed to an obvious boolean flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	d8983910a4	readahead: pass real splice size Pass real splice size to page_cache_readahead_ondemand(). The splice code works in chunks of 16 pages internally. The readahead code should be told of the overall splice size, instead of the internal chunk size. Otherwize bad things may happen. Imagine some 17-page random splice reads. The code before this patch will result in two readahead calls: readahead(16); readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O and 31 readahead miss pages. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	431a4820bf	readahead: move synchronous readahead call out of splice loop Move synchronous page_cache_readahead_ondemand() call out of splice loop. This avoids one pointless page allocation/insertion in case of non-zero ra_pages, or many pointless readahead calls in case of zero ra_pages. Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will not get expected readahead behavior anyway. The splice code works in batches of 16 pages, which can be taken as another form of synchronous readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	a08a166fe7	readahead: convert splice invocations Convert splice reads to use on-demand readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Jens Axboe <axboe@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Jens Axboe	bcd4f3acba	splice: direct splicing updates ppos twice OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> reported that he's noticed nfsd read corruption in recent kernels, and did the hard work of discovering that it's due to splice updating the file position twice. This means that the next operation would start further ahead than it should. nfsd_vfs_read() splice_direct_to_actor() while(len) { do_splice_to() [update sd->pos] -> generic_file_splice_read() [read from sd->pos] nfsd_direct_splice_actor() -> __splice_from_pipe() [update sd->pos] There's nothing wrong with the core splice code, but the direct splicing is an addon that calls both input and output paths. So it has to take care in locally caching offset so it remains correct. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 15:02:48 +02:00
Jens Axboe	51a92c0f6c	splice: fix offset mangling with direct splicing (sendfile) If the output actor doesn't transfer the full amount of data, we will increment ppos too much. Two related bugs in there: - We need to break out and return actor() retval if it is shorted than what we spliced into the pipe. - Adjust ppos only according to actor() return. Also fix loop problem in generic_file_splice_read(), it should not keep going when data has already been transferred. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-13 14:14:31 +02:00
James Morris	29ce20586b	security: revalidate rw permissions for sys_splice and sys_vmsplice Revalidate read/write permissions for splice(2) and vmslice(2), in case security policy has changed since the files were opened. Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-13 14:14:29 +02:00
Jens Axboe	0845718daf	pipe: add documentation and comments As per Andrew Mortons request, here's a set of documentation for the generic pipe_buf_operations hooks, the pipe, and pipe_buffer structures. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:16 +02:00
Jens Axboe	cac36bb06e	pipe: change the ->pin() operation to ->confirm() The name 'pin' was badly chosen, it doesn't pin a pipe buffer in the most commonly used sense in the kernel. So change the name to 'confirm', after debating this issue with Hugh Dickins a bit. A good return from ->confirm() means that the buffer is really there, and that the contents are good. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:15 +02:00
Jens Axboe	932cc6d4f7	splice: completely document external interface with kerneldoc Also add fs/splice.c as a kerneldoc target with a smaller blurb that should be expanded to better explain the overview of splice. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:15 +02:00
Jens Axboe	497f9625c2	pipe: allow passing around of ops private pointer relay needs this for proper consumption handling, and the network receive support needs it as well to lookup the sk_buff on pipe release. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:14 +02:00
Jens Axboe	d6b29d7cee	splice: divorce the splice structure/function definitions from the pipe header We need to move even more stuff into the header so that folks can use the splice_to_pipe() implementation instead of open-coding a lot of pipe knowledge (see relay implementation), so move to our own header file finally. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:14 +02:00
Jens Axboe	6a14b90bb6	vmsplice: add vmsplice-to-user support A bit of a cheat, it actually just copies the data to userspace. But this makes the interface nice and symmetric and enables people to build on splice, with room for future improvement in performance. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:12 +02:00
Jens Axboe	c66ab6fa70	splice: abstract out actor data For direct splicing (or private splicing), the output may not be a file. So abstract out the handling into a specified actor function and put the data in the splice_desc structure earlier, so we can build on top of that. This is the first step in better splice handling for drivers, and also for implementing vmsplice _to_ user memory. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:12 +02:00
Jens Axboe	02676e5aee	splice: only check do_wakeup in splice_to_pipe() for a real pipe We only ever set do_wakeup to non-zero if the pipe has an inode backing, so it's pointless to check outside the pipe->inode check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-15 13:16:13 +02:00
Jens Axboe	00de00bdad	splice: fix leak of pages on short splice to pipe If the destination pipe is full and we already transferred data, we break out instead of waiting for more pipe room. The exit logic looks at spd->nr_pages to see if we moved everything inside the spd container, but we decrement that variable in the loop to decide when spd has emptied. Instead we want to compare to the original page count in the spd, so cache that in a local variable. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-15 13:14:22 +02:00
Jens Axboe	17ee4f49ab	splice: adjust balance_dirty_pages_ratelimited() call As we have potentially dirtied more than 1 page, we should indicate as such to the dirty page balancing. So call balance_dirty_pages_ratelimited_nr() and pass in the approximate number of pages we dirtied. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-15 13:10:37 +02:00
Jens Axboe	620a324b74	splice: __generic_file_splice_read: fix read/truncate race Original patch and description from Neil Brown <neilb@suse.de>, merged and adapted to splice branch by me. Neils text follows: __generic_file_splice_read() currently samples the i_size at the start and doesn't do so again unless it needs to call ->readpage to load a page. After ->readpage it has to re-sample i_size as a truncate may have caused that page to be filled with zeros, and the read() call should not see these. However there are other activities that might cause ->readpage to be called on a page between the time that __generic_file_splice_read() samples i_size and when it finds that it has an uptodate page. These include at least read-ahead and possibly another thread performing a read So we must sample i_size after it has an uptodate page. Thus the current sampling at the start and after a read can be replaced with a sampling before page addition into spd. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-08 08:34:11 +02:00
Hugh Dickins	475ecade68	splice: __generic_file_splice_read: fix i_size_read() length checks __generic_file_splice_read's partial page check, at eof after readpage, not only got its calculations wrong, but also reused the loff variable: causing data corruption when splicing from a non-0 offset in the file's last page (revealed by ext2 -b 1024 testing on a loop of a tmpfs file). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-08 08:34:05 +02:00
Jens Axboe	20d698db67	splice: move balance_dirty_pages_ratelimited() outside of splice actor I've seen inode related deadlocks, so move this call outside of the actor itself, which may hold the inode lock. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-08 08:33:59 +02:00
Jens Axboe	267adc3e66	splice: remove do_splice_direct() symbol export It's only supposed to be used by do_sendfile(), which is never modular. So kill the export. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-08 08:33:41 +02:00
Jens Axboe	d366d39885	splice: move inode size check into generic_file_splice_read() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-06-08 08:32:38 +02:00
Jens Axboe	86aa5ac53e	[PATCH] splice: always call into page_cache_readahead() Don't try to guess what the read-ahead logic will do, allow it to make its own decisions. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-05-08 08:46:19 +02:00
Fengguang Wu	9ae9d68cbf	[PATCH] splice(): fix interaction with readahead Eric Dumazet, thank you for disclosing this bug. Readahead logic somehow fails to populate the page range with data. It can be because 1) the readahead routine is not always called in the following lines of fs/splice.c: if (!loff \|\| nr_pages > 1) page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages); 2) even called, page_cache_readahead() wont guarantee the pages are there. It wont submit readahead I/O for pages already in the radix tree, or when (ra_pages == 0), or after 256 cache hits. In your case, it should be because of the retried reads, which lead to excessive cache hits, and disables readahead at some time. And that _one_ failure of readahead blocks the whole read process. The application receives EAGAIN and retries the read, but __generic_file_splice_read() refuse to make progress: - in the previous invocation, it has allocated a blank page and inserted it into the radix tree, but never has the chance to start I/O for it: the test of SPLICE_F_NONBLOCK goes before that. - in the retried invocation, the readahead code will neither get out of the cache hit mode, nor will it submit I/O for an already existing page. Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-05-08 08:44:36 +02:00
Dmitriy Monakhov	d9993c37ef	[PATCH] splice: partial write fix Currently if partial write has happened while ->commit_write() then page wasn't marked as accessed and rebalanced. Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-03-29 14:26:42 +02:00
Mark Fasheh	40bee44eae	Export __splice_from_pipe() Ocfs2 wants to implement it's own splice write actor so that it can better manage cluster / page locks. This lets us re-use the rest of splice write while only providing our own code where it's actually important. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-03-27 08:55:47 +02:00
Nick Piggin	08c7259163	2/2 splice: dont readpage Splice does not need to readpage to bring the page uptodate before writing to it, because prepare_write will take care of that for us. Splice is also wrong to SetPageUptodate before the page is actually uptodate. This results in the old uninitialised memory leak. This gets fixed as a matter of course when removing the readpage logic. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-03-27 08:55:39 +02:00
Nick Piggin	485ddb4b97	1/2 splice: dont steal Stealing pages with splice is problematic because we cannot just insert an uptodate page into the pagecache and hope the filesystem can take care of it later. We also cannot just ClearPageUptodate, then hope prepare_write does not write anything into the page, because I don't think prepare_write gives that guarantee. Remove support for SPLICE_F_MOVE for now. If we really want to bring it back, we might be able to do so with a the new filesystem buffered write aops APIs I'm working on. If we really don't want to bring it back, then we should decide that sooner rather than later, and remove the flag and all the stealing infrastructure before anybody starts using it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-03-27 08:55:08 +02:00
Eric Dumazet	d4c3cca941	[PATCH] constify pipe_buf_operations - pipe/splice should use const pipe_buf_operations and file_operations - struct pipe_inode_info has an unused field "start" : get rid of it. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-13 09:05:47 -08:00
Josef "Jeff" Sipek	0f7fc9e4d0	[PATCH] VFS: change struct file to use struct path This patch changes struct file to use struct path instead of having independent pointers to struct dentry and struct vfsmount, and converts all users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}. Additionally, it adds two #define's to make the transition easier for users of the f_dentry and f_vfsmnt. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:41 -08:00
Jens Axboe	ddac0d39cf	[PATCH] splice: fix problem introduced with inode diet After the inode slimming patch that unionised i_pipe/i_bdev/i_cdev, it's no longer enough to check for existance of ->i_pipe to verify that this is a pipe. Original patch from Eric Dumazet <dada1@cosmosbay.com> Final solution suggested by Linus. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-11-04 08:45:39 -08:00
Nick Piggin	2ae88149a2	[PATCH] mm: clean up pagecache allocation - Consolidate page_cache_alloc - Fix splice: only the pagecache pages and filesystem data need to use mapping_gfp_mask. - Fix grab_cache_page_nowait: same as splice, also honour NUMA placement. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-10-28 11:30:50 -07:00
Jens Axboe	8c34e2d632	[PATCH] Remove SUID when splicing into an inode Originally from Mark Fasheh <mark.fasheh@oracle.com> generic_file_splice_write() does not remove S_ISUID or S_ISGID. This is inconsistent with the way we generally write to files. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2006-10-19 20:53:09 +02:00
Mark Fasheh	6da6180982	[PATCH] Introduce generic_file_splice_write_nolock() This allows file systems to manage their own i_mutex locking while still re-using the generic_file_splice_write() logic. OCFS2 in particular wants this so that it can order cluster locks within i_mutex. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2006-10-19 20:53:08 +02:00
Mark Fasheh	62752ee198	[PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2006-10-19 20:53:08 +02:00
Jens Axboe	e6e80f294c	[PATCH] splice: fix pipe_to_file() ->prepare_write() error path Don't jump to the unlock+release path, we already did that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2006-10-12 15:08:51 +02:00
Jens Axboe	0fe2347957	[PATCH] Update axboe@suse.de email address As people often look for the copyright in files to see who to mail, update the link to a neutral one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2006-09-30 20:52:34 +02:00

1 2 3 4 5

204 Commits