2005-04-16 15:20:36 -07:00
/*
* linux / fs / namespace . c
*
* ( C ) Copyright Al Viro 2000 , 2001
* Released under GPL v2 .
*
* Based on code from fs / super . c , copyright Linus Torvalds and others .
* Heavily rewritten .
*/
# include <linux/syscalls.h>
# include <linux/slab.h>
# include <linux/sched.h>
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
# include <linux/spinlock.h>
# include <linux/percpu.h>
2005-04-16 15:20:36 -07:00
# include <linux/init.h>
2006-09-29 01:58:57 -07:00
# include <linux/kernel.h>
2005-04-16 15:20:36 -07:00
# include <linux/acct.h>
2006-01-11 12:17:46 -08:00
# include <linux/capability.h>
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
# include <linux/cpumask.h>
2005-04-16 15:20:36 -07:00
# include <linux/module.h>
2006-08-14 22:43:23 -07:00
# include <linux/sysfs.h>
2005-04-16 15:20:36 -07:00
# include <linux/seq_file.h>
2006-12-08 02:37:56 -08:00
# include <linux/mnt_namespace.h>
2005-04-16 15:20:36 -07:00
# include <linux/namei.h>
2009-07-08 01:54:37 +04:00
# include <linux/nsproxy.h>
2005-04-16 15:20:36 -07:00
# include <linux/security.h>
# include <linux/mount.h>
2006-09-30 20:52:18 +02:00
# include <linux/ramfs.h>
2008-02-06 01:37:57 -08:00
# include <linux/log2.h>
2008-03-26 22:11:34 +01:00
# include <linux/idr.h>
2009-03-29 19:50:06 -04:00
# include <linux/fs_struct.h>
2009-12-17 21:24:27 -05:00
# include <linux/fsnotify.h>
2005-04-16 15:20:36 -07:00
# include <asm/uaccess.h>
# include <asm/unistd.h>
2005-11-07 17:19:07 -05:00
# include "pnode.h"
2007-07-15 23:41:25 -07:00
# include "internal.h"
2005-04-16 15:20:36 -07:00
2008-02-06 01:37:57 -08:00
# define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
# define HASH_SIZE (1UL << HASH_SHIFT)
2005-11-07 17:15:49 -05:00
static int event ;
2008-03-26 22:11:34 +01:00
static DEFINE_IDA ( mnt_id_ida ) ;
2008-03-27 13:06:23 +01:00
static DEFINE_IDA ( mnt_group_ida ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
static DEFINE_SPINLOCK ( mnt_id_lock ) ;
2009-06-24 03:12:00 -04:00
static int mnt_id_start = 0 ;
static int mnt_group_start = 1 ;
2005-04-16 15:20:36 -07:00
2006-03-26 01:37:24 -08:00
static struct list_head * mount_hashtable __read_mostly ;
2006-12-06 20:33:20 -08:00
static struct kmem_cache * mnt_cache __read_mostly ;
2005-11-07 17:17:51 -05:00
static struct rw_semaphore namespace_sem ;
2005-04-16 15:20:36 -07:00
2006-01-16 22:14:23 -08:00
/* /sys/fs */
2007-10-29 14:17:23 -06:00
struct kobject * fs_kobj ;
EXPORT_SYMBOL_GPL ( fs_kobj ) ;
2006-01-16 22:14:23 -08:00
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock may be taken for read to prevent changes to the
* vfsmount hash , ie . during mountpoint lookups or walking back
* up the tree .
*
* It should be taken for write in all cases where the vfsmount
* tree or hash is modified or when a vfsmount structure is modified .
*/
DEFINE_BRLOCK ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
static inline unsigned long hash ( struct vfsmount * mnt , struct dentry * dentry )
{
2005-11-07 17:16:09 -05:00
unsigned long tmp = ( ( unsigned long ) mnt / L1_CACHE_BYTES ) ;
tmp + = ( ( unsigned long ) dentry / L1_CACHE_BYTES ) ;
2008-02-06 01:37:57 -08:00
tmp = tmp + ( tmp > > HASH_SHIFT ) ;
return tmp & ( HASH_SIZE - 1 ) ;
2005-04-16 15:20:36 -07:00
}
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
# define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* allocation is serialized by namespace_sem , but we need the spinlock to
* serialize with freeing .
*/
2008-03-26 22:11:34 +01:00
static int mnt_alloc_id ( struct vfsmount * mnt )
{
int res ;
retry :
ida_pre_get ( & mnt_id_ida , GFP_KERNEL ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
spin_lock ( & mnt_id_lock ) ;
2009-06-24 03:12:00 -04:00
res = ida_get_new_above ( & mnt_id_ida , mnt_id_start , & mnt - > mnt_id ) ;
if ( ! res )
mnt_id_start = mnt - > mnt_id + 1 ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
spin_unlock ( & mnt_id_lock ) ;
2008-03-26 22:11:34 +01:00
if ( res = = - EAGAIN )
goto retry ;
return res ;
}
static void mnt_free_id ( struct vfsmount * mnt )
{
2009-06-24 03:12:00 -04:00
int id = mnt - > mnt_id ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
spin_lock ( & mnt_id_lock ) ;
2009-06-24 03:12:00 -04:00
ida_remove ( & mnt_id_ida , id ) ;
if ( mnt_id_start > id )
mnt_id_start = id ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
spin_unlock ( & mnt_id_lock ) ;
2008-03-26 22:11:34 +01:00
}
2008-03-27 13:06:23 +01:00
/*
* Allocate a new peer group ID
*
* mnt_group_ida is protected by namespace_sem
*/
static int mnt_alloc_group_id ( struct vfsmount * mnt )
{
2009-06-24 03:12:00 -04:00
int res ;
2008-03-27 13:06:23 +01:00
if ( ! ida_pre_get ( & mnt_group_ida , GFP_KERNEL ) )
return - ENOMEM ;
2009-06-24 03:12:00 -04:00
res = ida_get_new_above ( & mnt_group_ida ,
mnt_group_start ,
& mnt - > mnt_group_id ) ;
if ( ! res )
mnt_group_start = mnt - > mnt_group_id + 1 ;
return res ;
2008-03-27 13:06:23 +01:00
}
/*
* Release a peer group ID
*/
void mnt_release_group_id ( struct vfsmount * mnt )
{
2009-06-24 03:12:00 -04:00
int id = mnt - > mnt_group_id ;
ida_remove ( & mnt_group_ida , id ) ;
if ( mnt_group_start > id )
mnt_group_start = id ;
2008-03-27 13:06:23 +01:00
mnt - > mnt_group_id = 0 ;
}
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
/*
* vfsmount lock must be held for read
*/
static inline void mnt_add_count ( struct vfsmount * mnt , int n )
{
# ifdef CONFIG_SMP
this_cpu_add ( mnt - > mnt_pcp - > mnt_count , n ) ;
# else
preempt_disable ( ) ;
mnt - > mnt_count + = n ;
preempt_enable ( ) ;
# endif
}
static inline void mnt_set_count ( struct vfsmount * mnt , int n )
{
# ifdef CONFIG_SMP
this_cpu_write ( mnt - > mnt_pcp - > mnt_count , n ) ;
# else
mnt - > mnt_count = n ;
# endif
}
/*
* vfsmount lock must be held for read
*/
static inline void mnt_inc_count ( struct vfsmount * mnt )
{
mnt_add_count ( mnt , 1 ) ;
}
/*
* vfsmount lock must be held for read
*/
static inline void mnt_dec_count ( struct vfsmount * mnt )
{
mnt_add_count ( mnt , - 1 ) ;
}
/*
* vfsmount lock must be held for write
*/
unsigned int mnt_get_count ( struct vfsmount * mnt )
{
# ifdef CONFIG_SMP
2011-01-14 22:30:21 -05:00
unsigned int count = 0 ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
int cpu ;
for_each_possible_cpu ( cpu ) {
count + = per_cpu_ptr ( mnt - > mnt_pcp , cpu ) - > mnt_count ;
}
return count ;
# else
return mnt - > mnt_count ;
# endif
}
2011-03-17 22:08:28 -04:00
static struct vfsmount * alloc_vfsmnt ( const char * name )
2005-04-16 15:20:36 -07:00
{
2007-02-10 01:45:03 -08:00
struct vfsmount * mnt = kmem_cache_zalloc ( mnt_cache , GFP_KERNEL ) ;
2005-04-16 15:20:36 -07:00
if ( mnt ) {
2008-03-26 22:11:34 +01:00
int err ;
err = mnt_alloc_id ( mnt ) ;
2008-07-21 18:06:36 +08:00
if ( err )
goto out_free_cache ;
if ( name ) {
mnt - > mnt_devname = kstrdup ( name , GFP_KERNEL ) ;
if ( ! mnt - > mnt_devname )
goto out_free_id ;
2008-03-26 22:11:34 +01:00
}
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
# ifdef CONFIG_SMP
mnt - > mnt_pcp = alloc_percpu ( struct mnt_pcp ) ;
if ( ! mnt - > mnt_pcp )
goto out_free_devname ;
2011-01-14 22:30:21 -05:00
this_cpu_add ( mnt - > mnt_pcp - > mnt_count , 1 ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
# else
mnt - > mnt_count = 1 ;
mnt - > mnt_writers = 0 ;
# endif
2005-04-16 15:20:36 -07:00
INIT_LIST_HEAD ( & mnt - > mnt_hash ) ;
INIT_LIST_HEAD ( & mnt - > mnt_child ) ;
INIT_LIST_HEAD ( & mnt - > mnt_mounts ) ;
INIT_LIST_HEAD ( & mnt - > mnt_list ) ;
2005-07-07 17:57:30 -07:00
INIT_LIST_HEAD ( & mnt - > mnt_expire ) ;
2005-11-07 17:19:33 -05:00
INIT_LIST_HEAD ( & mnt - > mnt_share ) ;
2005-11-07 17:20:48 -05:00
INIT_LIST_HEAD ( & mnt - > mnt_slave_list ) ;
INIT_LIST_HEAD ( & mnt - > mnt_slave ) ;
2009-12-17 21:24:27 -05:00
# ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD ( & mnt - > mnt_fsnotify_marks ) ;
2009-04-26 20:25:54 +10:00
# endif
2005-04-16 15:20:36 -07:00
}
return mnt ;
2008-07-21 18:06:36 +08:00
2009-04-26 20:25:54 +10:00
# ifdef CONFIG_SMP
out_free_devname :
kfree ( mnt - > mnt_devname ) ;
# endif
2008-07-21 18:06:36 +08:00
out_free_id :
mnt_free_id ( mnt ) ;
out_free_cache :
kmem_cache_free ( mnt_cache , mnt ) ;
return NULL ;
2005-04-16 15:20:36 -07:00
}
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
/*
* Most r / o checks on a fs are for operations that take
* discrete amounts of time , like a write ( ) or unlink ( ) .
* We must keep track of when those operations start
* ( for permission checks ) and when they end , so that
* we can determine when writes are able to occur to
* a filesystem .
*/
/*
* __mnt_is_readonly : check whether a mount is read - only
* @ mnt : the mount to check for its write status
*
* This shouldn ' t be used directly ouside of the VFS .
* It does not guarantee that the filesystem will stay
* r / w , just that it is right * now * . This can not and
* should not be used in place of IS_RDONLY ( inode ) .
* mnt_want / drop_write ( ) will _keep_ the filesystem
* r / w .
*/
int __mnt_is_readonly ( struct vfsmount * mnt )
{
2008-02-15 14:38:00 -08:00
if ( mnt - > mnt_flags & MNT_READONLY )
return 1 ;
if ( mnt - > mnt_sb - > s_flags & MS_RDONLY )
return 1 ;
return 0 ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
}
EXPORT_SYMBOL_GPL ( __mnt_is_readonly ) ;
2011-01-07 17:50:10 +11:00
static inline void mnt_inc_writers ( struct vfsmount * mnt )
2009-04-26 20:25:54 +10:00
{
# ifdef CONFIG_SMP
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
this_cpu_inc ( mnt - > mnt_pcp - > mnt_writers ) ;
2009-04-26 20:25:54 +10:00
# else
mnt - > mnt_writers + + ;
# endif
}
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
2011-01-07 17:50:10 +11:00
static inline void mnt_dec_writers ( struct vfsmount * mnt )
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
{
2009-04-26 20:25:54 +10:00
# ifdef CONFIG_SMP
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
this_cpu_dec ( mnt - > mnt_pcp - > mnt_writers ) ;
2009-04-26 20:25:54 +10:00
# else
mnt - > mnt_writers - - ;
# endif
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
}
2011-01-07 17:50:10 +11:00
static unsigned int mnt_get_writers ( struct vfsmount * mnt )
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
{
2009-04-26 20:25:54 +10:00
# ifdef CONFIG_SMP
unsigned int count = 0 ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
int cpu ;
for_each_possible_cpu ( cpu ) {
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
count + = per_cpu_ptr ( mnt - > mnt_pcp , cpu ) - > mnt_writers ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
}
2009-04-26 20:25:54 +10:00
return count ;
# else
return mnt - > mnt_writers ;
# endif
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
}
2008-02-15 14:37:30 -08:00
/*
* Most r / o checks on a fs are for operations that take
* discrete amounts of time , like a write ( ) or unlink ( ) .
* We must keep track of when those operations start
* ( for permission checks ) and when they end , so that
* we can determine when writes are able to occur to
* a filesystem .
*/
/**
* mnt_want_write - get write access to a mount
* @ mnt : the mount on which to take a write
*
* This tells the low - level filesystem that a write is
* about to be performed to it , and makes sure that
* writes are allowed before returning success . When
* the write operation is finished , mnt_drop_write ( )
* must be called . This is effectively a refcount .
*/
int mnt_want_write ( struct vfsmount * mnt )
{
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
int ret = 0 ;
2009-04-26 20:25:54 +10:00
preempt_disable ( ) ;
2011-01-07 17:50:10 +11:00
mnt_inc_writers ( mnt ) ;
2009-04-26 20:25:54 +10:00
/*
2011-01-07 17:50:10 +11:00
* The store to mnt_inc_writers must be visible before we pass
2009-04-26 20:25:54 +10:00
* MNT_WRITE_HOLD loop below , so that the slowpath can see our
* incremented count after it has set MNT_WRITE_HOLD .
*/
smp_mb ( ) ;
while ( mnt - > mnt_flags & MNT_WRITE_HOLD )
cpu_relax ( ) ;
/*
* After the slowpath clears MNT_WRITE_HOLD , mnt_is_readonly will
* be set to match its requirements . So we must not load that until
* MNT_WRITE_HOLD is cleared .
*/
smp_rmb ( ) ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
if ( __mnt_is_readonly ( mnt ) ) {
2011-01-07 17:50:10 +11:00
mnt_dec_writers ( mnt ) ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
ret = - EROFS ;
goto out ;
}
out :
2009-04-26 20:25:54 +10:00
preempt_enable ( ) ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
return ret ;
2008-02-15 14:37:30 -08:00
}
EXPORT_SYMBOL_GPL ( mnt_want_write ) ;
2009-04-26 20:25:55 +10:00
/**
* mnt_clone_write - get write access to a mount
* @ mnt : the mount on which to take a write
*
* This is effectively like mnt_want_write , except
* it must only be used to take an extra write reference
* on a mountpoint that we already know has a write reference
* on it . This allows some optimisation .
*
* After finished , mnt_drop_write must be called as usual to
* drop the reference .
*/
int mnt_clone_write ( struct vfsmount * mnt )
{
/* superblock may be r/o */
if ( __mnt_is_readonly ( mnt ) )
return - EROFS ;
preempt_disable ( ) ;
2011-01-07 17:50:10 +11:00
mnt_inc_writers ( mnt ) ;
2009-04-26 20:25:55 +10:00
preempt_enable ( ) ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( mnt_clone_write ) ;
/**
* mnt_want_write_file - get write access to a file ' s mount
* @ file : the file who ' s mount on which to take a write
*
* This is like mnt_want_write , but it takes a file and can
* do some optimisations if the file is open for write already
*/
int mnt_want_write_file ( struct file * file )
{
2009-08-06 15:07:39 -07:00
struct inode * inode = file - > f_dentry - > d_inode ;
if ( ! ( file - > f_mode & FMODE_WRITE ) | | special_file ( inode - > i_mode ) )
2009-04-26 20:25:55 +10:00
return mnt_want_write ( file - > f_path . mnt ) ;
else
return mnt_clone_write ( file - > f_path . mnt ) ;
}
EXPORT_SYMBOL_GPL ( mnt_want_write_file ) ;
2008-02-15 14:37:30 -08:00
/**
* mnt_drop_write - give up write access to a mount
* @ mnt : the mount on which to give up write access
*
* Tells the low - level filesystem that we are done
* performing writes to it . Must be matched with
* mnt_want_write ( ) call above .
*/
void mnt_drop_write ( struct vfsmount * mnt )
{
2009-04-26 20:25:54 +10:00
preempt_disable ( ) ;
2011-01-07 17:50:10 +11:00
mnt_dec_writers ( mnt ) ;
2009-04-26 20:25:54 +10:00
preempt_enable ( ) ;
2008-02-15 14:37:30 -08:00
}
EXPORT_SYMBOL_GPL ( mnt_drop_write ) ;
2008-02-15 14:38:00 -08:00
static int mnt_make_readonly ( struct vfsmount * mnt )
2008-02-15 14:37:30 -08:00
{
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
int ret = 0 ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2009-04-26 20:25:54 +10:00
mnt - > mnt_flags | = MNT_WRITE_HOLD ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
/*
2009-04-26 20:25:54 +10:00
* After storing MNT_WRITE_HOLD , we ' ll read the counters . This store
* should be visible before we do .
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
*/
2009-04-26 20:25:54 +10:00
smp_mb ( ) ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
/*
2009-04-26 20:25:54 +10:00
* With writers on hold , if this value is zero , then there are
* definitely no active writers ( although held writers may subsequently
* increment the count , they ' ll have to wait , and decrement it after
* seeing MNT_READONLY ) .
*
* It is OK to have counter incremented on one CPU and decremented on
* another : the sum will add up correctly . The danger would be when we
* sum up each counter , if we read a counter before it is incremented ,
* but then read another CPU ' s count which it has been subsequently
* decremented from - - we would see more decrements than we should .
* MNT_WRITE_HOLD protects against this scenario , because
* mnt_want_write first increments count , then smp_mb , then spins on
* MNT_WRITE_HOLD , so it can ' t be decremented by another CPU while
* we ' re counting up here .
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
*/
2011-01-07 17:50:10 +11:00
if ( mnt_get_writers ( mnt ) > 0 )
2009-04-26 20:25:54 +10:00
ret = - EBUSY ;
else
2008-02-15 14:38:00 -08:00
mnt - > mnt_flags | = MNT_READONLY ;
2009-04-26 20:25:54 +10:00
/*
* MNT_READONLY must become visible before ~ MNT_WRITE_HOLD , so writers
* that become unheld will see MNT_READONLY .
*/
smp_wmb ( ) ;
mnt - > mnt_flags & = ~ MNT_WRITE_HOLD ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
return ret ;
2008-02-15 14:37:30 -08:00
}
2008-02-15 14:38:00 -08:00
static void __mnt_unmake_readonly ( struct vfsmount * mnt )
{
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2008-02-15 14:38:00 -08:00
mnt - > mnt_flags & = ~ MNT_READONLY ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-02-15 14:38:00 -08:00
}
2011-03-17 22:08:28 -04:00
static void free_vfsmnt ( struct vfsmount * mnt )
2005-04-16 15:20:36 -07:00
{
kfree ( mnt - > mnt_devname ) ;
2008-03-26 22:11:34 +01:00
mnt_free_id ( mnt ) ;
2009-04-26 20:25:54 +10:00
# ifdef CONFIG_SMP
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
free_percpu ( mnt - > mnt_pcp ) ;
2009-04-26 20:25:54 +10:00
# endif
2005-04-16 15:20:36 -07:00
kmem_cache_free ( mnt_cache , mnt ) ;
}
/*
2005-11-07 17:20:17 -05:00
* find the first or last mount at @ dentry on vfsmount @ mnt depending on
* @ dir . If @ dir is set return the first mount else return the last mount .
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
* vfsmount_lock must be held for read or write .
2005-04-16 15:20:36 -07:00
*/
2005-11-07 17:20:17 -05:00
struct vfsmount * __lookup_mnt ( struct vfsmount * mnt , struct dentry * dentry ,
int dir )
2005-04-16 15:20:36 -07:00
{
2005-11-07 17:16:09 -05:00
struct list_head * head = mount_hashtable + hash ( mnt , dentry ) ;
struct list_head * tmp = head ;
2005-04-16 15:20:36 -07:00
struct vfsmount * p , * found = NULL ;
for ( ; ; ) {
2005-11-07 17:20:17 -05:00
tmp = dir ? tmp - > next : tmp - > prev ;
2005-04-16 15:20:36 -07:00
p = NULL ;
if ( tmp = = head )
break ;
p = list_entry ( tmp , struct vfsmount , mnt_hash ) ;
if ( p - > mnt_parent = = mnt & & p - > mnt_mountpoint = = dentry ) {
2005-11-07 17:20:17 -05:00
found = p ;
2005-04-16 15:20:36 -07:00
break ;
}
}
return found ;
}
2005-11-07 17:20:17 -05:00
/*
* lookup_mnt increments the ref count before returning
* the vfsmount struct .
*/
2009-04-18 14:06:57 -04:00
struct vfsmount * lookup_mnt ( struct path * path )
2005-11-07 17:20:17 -05:00
{
struct vfsmount * child_mnt ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_read_lock ( vfsmount_lock ) ;
2009-04-18 14:06:57 -04:00
if ( ( child_mnt = __lookup_mnt ( path - > mnt , path - > dentry , 1 ) ) )
2005-11-07 17:20:17 -05:00
mntget ( child_mnt ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_read_unlock ( vfsmount_lock ) ;
2005-11-07 17:20:17 -05:00
return child_mnt ;
}
2005-04-16 15:20:36 -07:00
static inline int check_mnt ( struct vfsmount * mnt )
{
2006-12-08 02:37:56 -08:00
return mnt - > mnt_ns = = current - > nsproxy - > mnt_ns ;
2005-04-16 15:20:36 -07:00
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
*/
2006-12-08 02:37:56 -08:00
static void touch_mnt_namespace ( struct mnt_namespace * ns )
2005-11-07 17:15:49 -05:00
{
if ( ns ) {
ns - > event = + + event ;
wake_up_interruptible ( & ns - > poll ) ;
}
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
*/
2006-12-08 02:37:56 -08:00
static void __touch_mnt_namespace ( struct mnt_namespace * ns )
2005-11-07 17:15:49 -05:00
{
if ( ns & & ns - > event ! = event ) {
ns - > event = event ;
wake_up_interruptible ( & ns - > poll ) ;
}
}
2011-01-07 17:49:54 +11:00
/*
* Clear dentry ' s mounted state if it has no remaining mounts .
* vfsmount_lock must be held for write .
*/
static void dentry_reset_mounted ( struct vfsmount * mnt , struct dentry * dentry )
{
unsigned u ;
for ( u = 0 ; u < HASH_SIZE ; u + + ) {
struct vfsmount * p ;
list_for_each_entry ( p , & mount_hashtable [ u ] , mnt_hash ) {
if ( p - > mnt_mountpoint = = dentry )
return ;
}
}
spin_lock ( & dentry - > d_lock ) ;
dentry - > d_flags & = ~ DCACHE_MOUNTED ;
spin_unlock ( & dentry - > d_lock ) ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
*/
2008-03-21 20:48:19 -04:00
static void detach_mnt ( struct vfsmount * mnt , struct path * old_path )
2005-04-16 15:20:36 -07:00
{
2008-03-21 20:48:19 -04:00
old_path - > dentry = mnt - > mnt_mountpoint ;
old_path - > mnt = mnt - > mnt_parent ;
2005-04-16 15:20:36 -07:00
mnt - > mnt_parent = mnt ;
mnt - > mnt_mountpoint = mnt - > mnt_root ;
list_del_init ( & mnt - > mnt_child ) ;
list_del_init ( & mnt - > mnt_hash ) ;
2011-01-07 17:49:54 +11:00
dentry_reset_mounted ( old_path - > mnt , old_path - > dentry ) ;
2005-04-16 15:20:36 -07:00
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
*/
2005-11-07 17:19:50 -05:00
void mnt_set_mountpoint ( struct vfsmount * mnt , struct dentry * dentry ,
struct vfsmount * child_mnt )
{
child_mnt - > mnt_parent = mntget ( mnt ) ;
child_mnt - > mnt_mountpoint = dget ( dentry ) ;
2011-01-07 17:49:54 +11:00
spin_lock ( & dentry - > d_lock ) ;
dentry - > d_flags | = DCACHE_MOUNTED ;
spin_unlock ( & dentry - > d_lock ) ;
2005-11-07 17:19:50 -05:00
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
*/
2008-03-21 20:48:19 -04:00
static void attach_mnt ( struct vfsmount * mnt , struct path * path )
2005-04-16 15:20:36 -07:00
{
2008-03-21 20:48:19 -04:00
mnt_set_mountpoint ( path - > mnt , path - > dentry , mnt ) ;
2005-11-07 17:19:50 -05:00
list_add_tail ( & mnt - > mnt_hash , mount_hashtable +
2008-03-21 20:48:19 -04:00
hash ( path - > mnt , path - > dentry ) ) ;
list_add_tail ( & mnt - > mnt_child , & path - > mnt - > mnt_mounts ) ;
2005-11-07 17:19:50 -05:00
}
2011-01-16 16:32:11 -05:00
static inline void __mnt_make_longterm ( struct vfsmount * mnt )
{
# ifdef CONFIG_SMP
atomic_inc ( & mnt - > mnt_longterm ) ;
# endif
}
/* needs vfsmount lock for write */
static inline void __mnt_make_shortterm ( struct vfsmount * mnt )
{
# ifdef CONFIG_SMP
atomic_dec ( & mnt - > mnt_longterm ) ;
# endif
}
2005-11-07 17:19:50 -05:00
/*
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
* vfsmount lock must be held for write
2005-11-07 17:19:50 -05:00
*/
static void commit_tree ( struct vfsmount * mnt )
{
struct vfsmount * parent = mnt - > mnt_parent ;
struct vfsmount * m ;
LIST_HEAD ( head ) ;
2006-12-08 02:37:56 -08:00
struct mnt_namespace * n = parent - > mnt_ns ;
2005-11-07 17:19:50 -05:00
BUG_ON ( parent = = mnt ) ;
list_add_tail ( & head , & mnt - > mnt_list ) ;
2011-01-14 22:30:21 -05:00
list_for_each_entry ( m , & head , mnt_list ) {
2006-12-08 02:37:56 -08:00
m - > mnt_ns = n ;
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( m ) ;
2011-01-14 22:30:21 -05:00
}
2005-11-07 17:19:50 -05:00
list_splice ( & head , n - > list . prev ) ;
list_add_tail ( & mnt - > mnt_hash , mount_hashtable +
hash ( parent , mnt - > mnt_mountpoint ) ) ;
list_add_tail ( & mnt - > mnt_child , & parent - > mnt_mounts ) ;
2006-12-08 02:37:56 -08:00
touch_mnt_namespace ( n ) ;
2005-04-16 15:20:36 -07:00
}
static struct vfsmount * next_mnt ( struct vfsmount * p , struct vfsmount * root )
{
struct list_head * next = p - > mnt_mounts . next ;
if ( next = = & p - > mnt_mounts ) {
while ( 1 ) {
if ( p = = root )
return NULL ;
next = p - > mnt_child . next ;
if ( next ! = & p - > mnt_parent - > mnt_mounts )
break ;
p = p - > mnt_parent ;
}
}
return list_entry ( next , struct vfsmount , mnt_child ) ;
}
2005-11-07 17:21:20 -05:00
static struct vfsmount * skip_mnt_tree ( struct vfsmount * p )
{
struct list_head * prev = p - > mnt_mounts . prev ;
while ( prev ! = & p - > mnt_mounts ) {
p = list_entry ( prev , struct vfsmount , mnt_child ) ;
prev = p - > mnt_mounts . prev ;
}
return p ;
}
2011-03-17 22:08:28 -04:00
struct vfsmount *
vfs_kern_mount ( struct file_system_type * type , int flags , const char * name , void * data )
{
struct vfsmount * mnt ;
struct dentry * root ;
if ( ! type )
return ERR_PTR ( - ENODEV ) ;
mnt = alloc_vfsmnt ( name ) ;
if ( ! mnt )
return ERR_PTR ( - ENOMEM ) ;
if ( flags & MS_KERNMOUNT )
mnt - > mnt_flags = MNT_INTERNAL ;
root = mount_fs ( type , flags , name , data ) ;
if ( IS_ERR ( root ) ) {
free_vfsmnt ( mnt ) ;
return ERR_CAST ( root ) ;
}
mnt - > mnt_root = root ;
mnt - > mnt_sb = root - > d_sb ;
mnt - > mnt_mountpoint = mnt - > mnt_root ;
mnt - > mnt_parent = mnt ;
return mnt ;
}
EXPORT_SYMBOL_GPL ( vfs_kern_mount ) ;
2005-11-07 17:17:22 -05:00
static struct vfsmount * clone_mnt ( struct vfsmount * old , struct dentry * root ,
int flag )
2005-04-16 15:20:36 -07:00
{
struct super_block * sb = old - > mnt_sb ;
struct vfsmount * mnt = alloc_vfsmnt ( old - > mnt_devname ) ;
if ( mnt ) {
2008-03-27 13:06:23 +01:00
if ( flag & ( CL_SLAVE | CL_PRIVATE ) )
mnt - > mnt_group_id = 0 ; /* not a peer of original */
else
mnt - > mnt_group_id = old - > mnt_group_id ;
if ( ( flag & CL_MAKE_SHARED ) & & ! mnt - > mnt_group_id ) {
int err = mnt_alloc_group_id ( mnt ) ;
if ( err )
goto out_free ;
}
2010-10-05 12:31:09 +02:00
mnt - > mnt_flags = old - > mnt_flags & ~ MNT_WRITE_HOLD ;
2005-04-16 15:20:36 -07:00
atomic_inc ( & sb - > s_active ) ;
mnt - > mnt_sb = sb ;
mnt - > mnt_root = dget ( root ) ;
mnt - > mnt_mountpoint = mnt - > mnt_root ;
mnt - > mnt_parent = mnt ;
2005-11-07 17:19:50 -05:00
2005-11-07 17:21:01 -05:00
if ( flag & CL_SLAVE ) {
list_add ( & mnt - > mnt_slave , & old - > mnt_slave_list ) ;
mnt - > mnt_master = old ;
CLEAR_MNT_SHARED ( mnt ) ;
2007-06-07 12:20:32 -04:00
} else if ( ! ( flag & CL_PRIVATE ) ) {
2010-01-16 13:28:47 -05:00
if ( ( flag & CL_MAKE_SHARED ) | | IS_MNT_SHARED ( old ) )
2005-11-07 17:21:01 -05:00
list_add ( & mnt - > mnt_share , & old - > mnt_share ) ;
if ( IS_MNT_SLAVE ( old ) )
list_add ( & mnt - > mnt_slave , & old - > mnt_slave ) ;
mnt - > mnt_master = old - > mnt_master ;
}
2005-11-07 17:19:50 -05:00
if ( flag & CL_MAKE_SHARED )
set_mnt_shared ( mnt ) ;
2005-04-16 15:20:36 -07:00
/* stick the duplicate mount on the same expiry list
* as the original if that was on one */
2005-11-07 17:17:22 -05:00
if ( flag & CL_EXPIRE ) {
if ( ! list_empty ( & old - > mnt_expire ) )
list_add ( & mnt - > mnt_expire , & old - > mnt_expire ) ;
}
2005-04-16 15:20:36 -07:00
}
return mnt ;
2008-03-27 13:06:23 +01:00
out_free :
free_vfsmnt ( mnt ) ;
return NULL ;
2005-04-16 15:20:36 -07:00
}
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
static inline void mntfree ( struct vfsmount * mnt )
2005-04-16 15:20:36 -07:00
{
struct super_block * sb = mnt - > mnt_sb ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
[PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-15 14:37:59 -08:00
/*
* This probably indicates that somebody messed
* up a mnt_want / drop_write ( ) pair . If this
* happens , the filesystem was probably unable
* to make r / w - > r / o transitions .
*/
2009-04-26 20:25:54 +10:00
/*
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
* The locking used to deal with mnt_count decrement provides barriers ,
* so mnt_get_writers ( ) below is safe .
2009-04-26 20:25:54 +10:00
*/
2011-01-07 17:50:10 +11:00
WARN_ON ( mnt_get_writers ( mnt ) ) ;
2009-12-17 21:24:27 -05:00
fsnotify_vfsmount_delete ( mnt ) ;
2005-04-16 15:20:36 -07:00
dput ( mnt - > mnt_root ) ;
free_vfsmnt ( mnt ) ;
deactivate_super ( sb ) ;
}
2011-01-14 22:30:21 -05:00
static void mntput_no_expire ( struct vfsmount * mnt )
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
{
put_again :
2011-01-14 22:30:21 -05:00
# ifdef CONFIG_SMP
br_read_lock ( vfsmount_lock ) ;
if ( likely ( atomic_read ( & mnt - > mnt_longterm ) ) ) {
mnt_dec_count ( mnt ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
br_read_unlock ( vfsmount_lock ) ;
2011-01-14 22:30:21 -05:00
return ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
}
2011-01-14 22:30:21 -05:00
br_read_unlock ( vfsmount_lock ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2011-01-14 22:30:21 -05:00
mnt_dec_count ( mnt ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
if ( mnt_get_count ( mnt ) ) {
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
return ;
}
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
# else
mnt_dec_count ( mnt ) ;
if ( likely ( mnt_get_count ( mnt ) ) )
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
return ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
br_write_lock ( vfsmount_lock ) ;
2011-01-14 22:30:21 -05:00
# endif
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
if ( unlikely ( mnt - > mnt_pinned ) ) {
mnt_add_count ( mnt , mnt - > mnt_pinned + 1 ) ;
mnt - > mnt_pinned = 0 ;
br_write_unlock ( vfsmount_lock ) ;
acct_auto_close_mnt ( mnt ) ;
goto put_again ;
2005-11-07 17:13:39 -05:00
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
mntfree ( mnt ) ;
}
void mntput ( struct vfsmount * mnt )
{
if ( mnt ) {
/* avoid cacheline pingpong, hope gcc doesn't get "smart" */
if ( unlikely ( mnt - > mnt_expiry_mark ) )
mnt - > mnt_expiry_mark = 0 ;
2011-01-14 22:30:21 -05:00
mntput_no_expire ( mnt ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
}
}
EXPORT_SYMBOL ( mntput ) ;
struct vfsmount * mntget ( struct vfsmount * mnt )
{
if ( mnt )
mnt_inc_count ( mnt ) ;
return mnt ;
}
EXPORT_SYMBOL ( mntget ) ;
2005-11-07 17:13:39 -05:00
void mnt_pin ( struct vfsmount * mnt )
{
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:13:39 -05:00
mnt - > mnt_pinned + + ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:13:39 -05:00
}
EXPORT_SYMBOL ( mnt_pin ) ;
void mnt_unpin ( struct vfsmount * mnt )
{
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:13:39 -05:00
if ( mnt - > mnt_pinned ) {
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
mnt_inc_count ( mnt ) ;
2005-11-07 17:13:39 -05:00
mnt - > mnt_pinned - - ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:13:39 -05:00
}
EXPORT_SYMBOL ( mnt_unpin ) ;
2005-04-16 15:20:36 -07:00
2008-02-08 04:21:35 -08:00
static inline void mangle ( struct seq_file * m , const char * s )
{
seq_escape ( m , s , " \t \n \\ " ) ;
}
/*
* Simple . show_options callback for filesystems which don ' t want to
* implement more complex mount option showing .
*
* See also save_mount_options ( ) .
*/
int generic_show_options ( struct seq_file * m , struct vfsmount * mnt )
{
2009-05-08 16:05:57 -04:00
const char * options ;
rcu_read_lock ( ) ;
options = rcu_dereference ( mnt - > mnt_sb - > s_options ) ;
2008-02-08 04:21:35 -08:00
if ( options ! = NULL & & options [ 0 ] ) {
seq_putc ( m , ' , ' ) ;
mangle ( m , options ) ;
}
2009-05-08 16:05:57 -04:00
rcu_read_unlock ( ) ;
2008-02-08 04:21:35 -08:00
return 0 ;
}
EXPORT_SYMBOL ( generic_show_options ) ;
/*
* If filesystem uses generic_show_options ( ) , this function should be
* called from the fill_super ( ) callback .
*
* The . remount_fs callback usually needs to be handled in a special
* way , to make sure , that previous options are not overwritten if the
* remount fails .
*
* Also note , that if the filesystem ' s . remount_fs function doesn ' t
* reset all options to their default value , but changes only newly
* given options , then the displayed options will not reflect reality
* any more .
*/
void save_mount_options ( struct super_block * sb , char * options )
{
2009-05-08 16:05:57 -04:00
BUG_ON ( sb - > s_options ) ;
rcu_assign_pointer ( sb - > s_options , kstrdup ( options , GFP_KERNEL ) ) ;
2008-02-08 04:21:35 -08:00
}
EXPORT_SYMBOL ( save_mount_options ) ;
2009-05-08 16:05:57 -04:00
void replace_mount_options ( struct super_block * sb , char * options )
{
char * old = sb - > s_options ;
rcu_assign_pointer ( sb - > s_options , options ) ;
if ( old ) {
synchronize_rcu ( ) ;
kfree ( old ) ;
}
}
EXPORT_SYMBOL ( replace_mount_options ) ;
2008-03-27 13:06:24 +01:00
# ifdef CONFIG_PROC_FS
2005-04-16 15:20:36 -07:00
/* iterator */
static void * m_start ( struct seq_file * m , loff_t * pos )
{
2008-03-27 13:06:24 +01:00
struct proc_mounts * p = m - > private ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:17:51 -05:00
down_read ( & namespace_sem ) ;
2008-03-27 13:06:24 +01:00
return seq_list_start ( & p - > ns - > list , * pos ) ;
2005-04-16 15:20:36 -07:00
}
static void * m_next ( struct seq_file * m , void * v , loff_t * pos )
{
2008-03-27 13:06:24 +01:00
struct proc_mounts * p = m - > private ;
2007-07-15 23:39:55 -07:00
2008-03-27 13:06:24 +01:00
return seq_list_next ( v , & p - > ns - > list , pos ) ;
2005-04-16 15:20:36 -07:00
}
static void m_stop ( struct seq_file * m , void * v )
{
2005-11-07 17:17:51 -05:00
up_read ( & namespace_sem ) ;
2005-04-16 15:20:36 -07:00
}
2010-02-05 00:40:25 -05:00
int mnt_had_events ( struct proc_mounts * p )
{
struct mnt_namespace * ns = p - > ns ;
int res = 0 ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_read_lock ( vfsmount_lock ) ;
2011-07-12 20:48:39 +02:00
if ( p - > m . poll_event ! = ns - > event ) {
p - > m . poll_event = ns - > event ;
2010-02-05 00:40:25 -05:00
res = 1 ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_read_unlock ( vfsmount_lock ) ;
2010-02-05 00:40:25 -05:00
return res ;
}
2008-03-27 13:06:25 +01:00
struct proc_fs_info {
int flag ;
const char * str ;
} ;
2008-07-04 09:47:13 +10:00
static int show_sb_opts ( struct seq_file * m , struct super_block * sb )
2005-04-16 15:20:36 -07:00
{
2008-03-27 13:06:25 +01:00
static const struct proc_fs_info fs_info [ ] = {
2005-04-16 15:20:36 -07:00
{ MS_SYNCHRONOUS , " ,sync " } ,
{ MS_DIRSYNC , " ,dirsync " } ,
{ MS_MANDLOCK , " ,mand " } ,
{ 0 , NULL }
} ;
2008-03-27 13:06:25 +01:00
const struct proc_fs_info * fs_infop ;
for ( fs_infop = fs_info ; fs_infop - > flag ; fs_infop + + ) {
if ( sb - > s_flags & fs_infop - > flag )
seq_puts ( m , fs_infop - > str ) ;
}
2008-07-04 09:47:13 +10:00
return security_sb_show_options ( m , sb ) ;
2008-03-27 13:06:25 +01:00
}
static void show_mnt_opts ( struct seq_file * m , struct vfsmount * mnt )
{
static const struct proc_fs_info mnt_info [ ] = {
2005-04-16 15:20:36 -07:00
{ MNT_NOSUID , " ,nosuid " } ,
{ MNT_NODEV , " ,nodev " } ,
{ MNT_NOEXEC , " ,noexec " } ,
2006-01-09 20:52:17 -08:00
{ MNT_NOATIME , " ,noatime " } ,
{ MNT_NODIRATIME , " ,nodiratime " } ,
2006-12-13 00:34:34 -08:00
{ MNT_RELATIME , " ,relatime " } ,
2005-04-16 15:20:36 -07:00
{ 0 , NULL }
} ;
2008-03-27 13:06:25 +01:00
const struct proc_fs_info * fs_infop ;
for ( fs_infop = mnt_info ; fs_infop - > flag ; fs_infop + + ) {
if ( mnt - > mnt_flags & fs_infop - > flag )
seq_puts ( m , fs_infop - > str ) ;
}
}
static void show_type ( struct seq_file * m , struct super_block * sb )
{
mangle ( m , sb - > s_type - > name ) ;
if ( sb - > s_subtype & & sb - > s_subtype [ 0 ] ) {
seq_putc ( m , ' . ' ) ;
mangle ( m , sb - > s_subtype ) ;
}
}
static int show_vfsmnt ( struct seq_file * m , void * v )
{
struct vfsmount * mnt = list_entry ( v , struct vfsmount , mnt_list ) ;
int err = 0 ;
2008-02-14 19:38:43 -08:00
struct path mnt_path = { . dentry = mnt - > mnt_root , . mnt = mnt } ;
2005-04-16 15:20:36 -07:00
2011-03-16 06:59:40 -04:00
if ( mnt - > mnt_sb - > s_op - > show_devname ) {
err = mnt - > mnt_sb - > s_op - > show_devname ( m , mnt ) ;
if ( err )
goto out ;
} else {
mangle ( m , mnt - > mnt_devname ? mnt - > mnt_devname : " none " ) ;
}
2005-04-16 15:20:36 -07:00
seq_putc ( m , ' ' ) ;
2008-02-14 19:38:43 -08:00
seq_path ( m , & mnt_path , " \t \n \\ " ) ;
2005-04-16 15:20:36 -07:00
seq_putc ( m , ' ' ) ;
2008-03-27 13:06:25 +01:00
show_type ( m , mnt - > mnt_sb ) ;
2008-02-15 14:38:00 -08:00
seq_puts ( m , __mnt_is_readonly ( mnt ) ? " ro " : " rw " ) ;
2008-07-04 09:47:13 +10:00
err = show_sb_opts ( m , mnt - > mnt_sb ) ;
if ( err )
goto out ;
2008-03-27 13:06:25 +01:00
show_mnt_opts ( m , mnt ) ;
2005-04-16 15:20:36 -07:00
if ( mnt - > mnt_sb - > s_op - > show_options )
err = mnt - > mnt_sb - > s_op - > show_options ( m , mnt ) ;
seq_puts ( m , " 0 0 \n " ) ;
2008-07-04 09:47:13 +10:00
out :
2005-04-16 15:20:36 -07:00
return err ;
}
2008-03-27 13:06:24 +01:00
const struct seq_operations mounts_op = {
2005-04-16 15:20:36 -07:00
. start = m_start ,
. next = m_next ,
. stop = m_stop ,
. show = show_vfsmnt
} ;
2008-03-27 13:06:25 +01:00
static int show_mountinfo ( struct seq_file * m , void * v )
{
struct proc_mounts * p = m - > private ;
struct vfsmount * mnt = list_entry ( v , struct vfsmount , mnt_list ) ;
struct super_block * sb = mnt - > mnt_sb ;
struct path mnt_path = { . dentry = mnt - > mnt_root , . mnt = mnt } ;
struct path root = p - > root ;
int err = 0 ;
seq_printf ( m , " %i %i %u:%u " , mnt - > mnt_id , mnt - > mnt_parent - > mnt_id ,
MAJOR ( sb - > s_dev ) , MINOR ( sb - > s_dev ) ) ;
2011-03-16 06:59:40 -04:00
if ( sb - > s_op - > show_path )
err = sb - > s_op - > show_path ( m , mnt ) ;
else
seq_dentry ( m , mnt - > mnt_root , " \t \n \\ " ) ;
if ( err )
goto out ;
2008-03-27 13:06:25 +01:00
seq_putc ( m , ' ' ) ;
seq_path_root ( m , & mnt_path , & root , " \t \n \\ " ) ;
if ( root . mnt ! = p - > root . mnt | | root . dentry ! = p - > root . dentry ) {
/*
* Mountpoint is outside root , discard that one . Ugly ,
* but less so than trying to do that in iterator in a
* race - free way ( due to renames ) .
*/
return SEQ_SKIP ;
}
seq_puts ( m , mnt - > mnt_flags & MNT_READONLY ? " ro " : " rw " ) ;
show_mnt_opts ( m , mnt ) ;
/* Tagged fields ("foo:X" or "bar") */
if ( IS_MNT_SHARED ( mnt ) )
seq_printf ( m , " shared:%i " , mnt - > mnt_group_id ) ;
2008-03-27 13:06:26 +01:00
if ( IS_MNT_SLAVE ( mnt ) ) {
int master = mnt - > mnt_master - > mnt_group_id ;
int dom = get_dominating_id ( mnt , & p - > root ) ;
seq_printf ( m , " master:%i " , master ) ;
if ( dom & & dom ! = master )
seq_printf ( m , " propagate_from:%i " , dom ) ;
}
2008-03-27 13:06:25 +01:00
if ( IS_MNT_UNBINDABLE ( mnt ) )
seq_puts ( m , " unbindable " ) ;
/* Filesystem specific data */
seq_puts ( m , " - " ) ;
show_type ( m , sb ) ;
seq_putc ( m , ' ' ) ;
2011-03-16 06:59:40 -04:00
if ( sb - > s_op - > show_devname )
err = sb - > s_op - > show_devname ( m , mnt ) ;
else
mangle ( m , mnt - > mnt_devname ? mnt - > mnt_devname : " none " ) ;
if ( err )
goto out ;
2008-03-27 13:06:25 +01:00
seq_puts ( m , sb - > s_flags & MS_RDONLY ? " ro " : " rw " ) ;
2008-07-04 09:47:13 +10:00
err = show_sb_opts ( m , sb ) ;
if ( err )
goto out ;
2008-03-27 13:06:25 +01:00
if ( sb - > s_op - > show_options )
err = sb - > s_op - > show_options ( m , mnt ) ;
seq_putc ( m , ' \n ' ) ;
2008-07-04 09:47:13 +10:00
out :
2008-03-27 13:06:25 +01:00
return err ;
}
const struct seq_operations mountinfo_op = {
. start = m_start ,
. next = m_next ,
. stop = m_stop ,
. show = show_mountinfo ,
} ;
2006-03-20 13:44:12 -05:00
static int show_vfsstat ( struct seq_file * m , void * v )
{
2007-07-15 23:39:55 -07:00
struct vfsmount * mnt = list_entry ( v , struct vfsmount , mnt_list ) ;
2008-02-14 19:38:43 -08:00
struct path mnt_path = { . dentry = mnt - > mnt_root , . mnt = mnt } ;
2006-03-20 13:44:12 -05:00
int err = 0 ;
/* device */
2011-03-16 06:59:40 -04:00
if ( mnt - > mnt_sb - > s_op - > show_devname ) {
err = mnt - > mnt_sb - > s_op - > show_devname ( m , mnt ) ;
} else {
if ( mnt - > mnt_devname ) {
seq_puts ( m , " device " ) ;
mangle ( m , mnt - > mnt_devname ) ;
} else
seq_puts ( m , " no device " ) ;
}
2006-03-20 13:44:12 -05:00
/* mount point */
seq_puts ( m , " mounted on " ) ;
2008-02-14 19:38:43 -08:00
seq_path ( m , & mnt_path , " \t \n \\ " ) ;
2006-03-20 13:44:12 -05:00
seq_putc ( m , ' ' ) ;
/* file system type */
seq_puts ( m , " with fstype " ) ;
2008-03-27 13:06:25 +01:00
show_type ( m , mnt - > mnt_sb ) ;
2006-03-20 13:44:12 -05:00
/* optional statistics */
if ( mnt - > mnt_sb - > s_op - > show_stats ) {
seq_putc ( m , ' ' ) ;
2011-03-16 06:59:40 -04:00
if ( ! err )
err = mnt - > mnt_sb - > s_op - > show_stats ( m , mnt ) ;
2006-03-20 13:44:12 -05:00
}
seq_putc ( m , ' \n ' ) ;
return err ;
}
2008-03-27 13:06:24 +01:00
const struct seq_operations mountstats_op = {
2006-03-20 13:44:12 -05:00
. start = m_start ,
. next = m_next ,
. stop = m_stop ,
. show = show_vfsstat ,
} ;
2008-03-27 13:06:24 +01:00
# endif /* CONFIG_PROC_FS */
2006-03-20 13:44:12 -05:00
2005-04-16 15:20:36 -07:00
/**
* may_umount_tree - check if a mount tree is busy
* @ mnt : root of mount tree
*
* This is called to check if a tree of mounts has any
* open files , pwds , chroots or sub mounts that are
* busy .
*/
int may_umount_tree ( struct vfsmount * mnt )
{
2005-11-07 17:17:22 -05:00
int actual_refs = 0 ;
int minimum_refs = 0 ;
struct vfsmount * p ;
2005-04-16 15:20:36 -07:00
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
/* write lock needed for mnt_get_count */
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:17:22 -05:00
for ( p = mnt ; p ; p = next_mnt ( p , mnt ) ) {
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
actual_refs + = mnt_get_count ( p ) ;
2005-04-16 15:20:36 -07:00
minimum_refs + = 2 ;
}
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
br_write_unlock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
if ( actual_refs > minimum_refs )
2006-03-27 01:14:51 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
2006-03-27 01:14:51 -08:00
return 1 ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( may_umount_tree ) ;
/**
* may_umount - check if a mount point is busy
* @ mnt : root of mount
*
* This is called to check if a mount point has any
* open files , pwds , chroots or sub mounts . If the
* mount has sub mounts this will return busy
* regardless of whether the sub mounts are busy .
*
* Doesn ' t take quota and stuff into account . IOW , in some cases it will
* give false negatives . The main reason why it ' s here is that we need
* a non - destructive way to look for easily umountable filesystems .
*/
int may_umount ( struct vfsmount * mnt )
{
2006-03-27 01:14:51 -08:00
int ret = 1 ;
2010-01-16 12:56:08 -05:00
down_read ( & namespace_sem ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:20:17 -05:00
if ( propagate_mount_busy ( mnt , 2 ) )
2006-03-27 01:14:51 -08:00
ret = 0 ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
br_write_unlock ( vfsmount_lock ) ;
2010-01-16 12:56:08 -05:00
up_read ( & namespace_sem ) ;
2005-11-07 17:20:17 -05:00
return ret ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( may_umount ) ;
2005-11-07 17:19:50 -05:00
void release_mounts ( struct list_head * head )
2005-11-07 17:17:04 -05:00
{
struct vfsmount * mnt ;
2006-01-08 01:03:19 -08:00
while ( ! list_empty ( head ) ) {
Introduce a handy list_first_entry macro
There are many places in the kernel where the construction like
foo = list_entry(head->next, struct foo_struct, list);
are used.
The code might look more descriptive and neat if using the macro
list_first_entry(head, type, member) \
list_entry((head)->next, type, member)
Here is the macro itself and the examples of its usage in the generic code.
If it will turn out to be useful, I can prepare the set of patches to
inject in into arch-specific code, drivers, networking, etc.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 00:30:19 -07:00
mnt = list_first_entry ( head , struct vfsmount , mnt_hash ) ;
2005-11-07 17:17:04 -05:00
list_del_init ( & mnt - > mnt_hash ) ;
if ( mnt - > mnt_parent ! = mnt ) {
struct dentry * dentry ;
struct vfsmount * m ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:17:04 -05:00
dentry = mnt - > mnt_mountpoint ;
m = mnt - > mnt_parent ;
mnt - > mnt_mountpoint = mnt - > mnt_root ;
mnt - > mnt_parent = mnt ;
2008-03-21 23:59:49 -04:00
m - > mnt_ghosts - - ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:17:04 -05:00
dput ( dentry ) ;
mntput ( m ) ;
}
2011-01-14 22:30:21 -05:00
mntput ( mnt ) ;
2005-11-07 17:17:04 -05:00
}
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
/*
* vfsmount lock must be held for write
* namespace_sem must be held for write
*/
2005-11-07 17:20:17 -05:00
void umount_tree ( struct vfsmount * mnt , int propagate , struct list_head * kill )
2005-04-16 15:20:36 -07:00
{
2011-01-15 20:08:44 -05:00
LIST_HEAD ( tmp_list ) ;
2005-04-16 15:20:36 -07:00
struct vfsmount * p ;
2006-06-26 00:24:40 -07:00
for ( p = mnt ; p ; p = next_mnt ( p , mnt ) )
2011-01-15 20:08:44 -05:00
list_move ( & p - > mnt_hash , & tmp_list ) ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:20:17 -05:00
if ( propagate )
2011-01-15 20:08:44 -05:00
propagate_umount ( & tmp_list ) ;
2005-11-07 17:20:17 -05:00
2011-01-15 20:08:44 -05:00
list_for_each_entry ( p , & tmp_list , mnt_hash ) {
2005-11-07 17:17:04 -05:00
list_del_init ( & p - > mnt_expire ) ;
list_del_init ( & p - > mnt_list ) ;
2006-12-08 02:37:56 -08:00
__touch_mnt_namespace ( p - > mnt_ns ) ;
p - > mnt_ns = NULL ;
2011-01-16 16:32:11 -05:00
__mnt_make_shortterm ( p ) ;
2005-11-07 17:17:04 -05:00
list_del_init ( & p - > mnt_child ) ;
2008-03-21 23:59:49 -04:00
if ( p - > mnt_parent ! = p ) {
p - > mnt_parent - > mnt_ghosts + + ;
2011-01-07 17:49:54 +11:00
dentry_reset_mounted ( p - > mnt_parent , p - > mnt_mountpoint ) ;
2008-03-21 23:59:49 -04:00
}
2005-11-07 17:20:17 -05:00
change_mnt_propagation ( p , MS_PRIVATE ) ;
2005-04-16 15:20:36 -07:00
}
2011-01-15 20:08:44 -05:00
list_splice ( & tmp_list , kill ) ;
2005-04-16 15:20:36 -07:00
}
2008-03-22 00:46:23 -04:00
static void shrink_submounts ( struct vfsmount * mnt , struct list_head * umounts ) ;
2005-04-16 15:20:36 -07:00
static int do_umount ( struct vfsmount * mnt , int flags )
{
2005-11-07 17:16:09 -05:00
struct super_block * sb = mnt - > mnt_sb ;
2005-04-16 15:20:36 -07:00
int retval ;
2005-11-07 17:17:04 -05:00
LIST_HEAD ( umount_list ) ;
2005-04-16 15:20:36 -07:00
retval = security_sb_umount ( mnt , flags ) ;
if ( retval )
return retval ;
/*
* Allow userspace to request a mountpoint be expired rather than
* unmounting unconditionally . Unmount only happens if :
* ( 1 ) the mark is already set ( the mark is cleared by mntput ( ) )
* ( 2 ) the usage count = = 1 [ parent vfsmount ] + 1 [ sys_umount ]
*/
if ( flags & MNT_EXPIRE ) {
2008-02-14 19:34:38 -08:00
if ( mnt = = current - > fs - > root . mnt | |
2005-04-16 15:20:36 -07:00
flags & ( MNT_FORCE | MNT_DETACH ) )
return - EINVAL ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
/*
* probably don ' t strictly need the lock here if we examined
* all race cases , but it ' s a slowpath .
*/
br_write_lock ( vfsmount_lock ) ;
if ( mnt_get_count ( mnt ) ! = 2 ) {
2011-02-23 16:59:49 +09:00
br_write_unlock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
return - EBUSY ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
}
br_write_unlock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
if ( ! xchg ( & mnt - > mnt_expiry_mark , 1 ) )
return - EAGAIN ;
}
/*
* If we may have to abort operations to get out of this
* mount , and they will themselves hold resources we must
* allow the fs to do things . In the Unix tradition of
* ' Gee thats tricky lets do it in userspace ' the umount_begin
* might fail to complete on the first run through as other tasks
* must return , and the like . Thats for the mount program to worry
* about for the moment .
*/
2008-04-24 07:21:56 -04:00
if ( flags & MNT_FORCE & & sb - > s_op - > umount_begin ) {
sb - > s_op - > umount_begin ( sb ) ;
}
2005-04-16 15:20:36 -07:00
/*
* No sense to grab the lock for this test , but test itself looks
* somewhat bogus . Suggestions for better replacement ?
* Ho - hum . . . In principle , we might treat that as umount + switch
* to rootfs . GC would eventually take care of the old vfsmount .
* Actually it makes sense , especially if rootfs would contain a
* / reboot - static binary that would close all descriptors and
* call reboot ( 9 ) . Then init ( 8 ) could umount root and exec / reboot .
*/
2008-02-14 19:34:38 -08:00
if ( mnt = = current - > fs - > root . mnt & & ! ( flags & MNT_DETACH ) ) {
2005-04-16 15:20:36 -07:00
/*
* Special case for " unmounting " root . . .
* we just try to remount it readonly .
*/
down_write ( & sb - > s_umount ) ;
2009-05-08 13:36:58 -04:00
if ( ! ( sb - > s_flags & MS_RDONLY ) )
2005-04-16 15:20:36 -07:00
retval = do_remount_sb ( sb , MS_RDONLY , NULL , 0 ) ;
up_write ( & sb - > s_umount ) ;
return retval ;
}
2005-11-07 17:17:51 -05:00
down_write ( & namespace_sem ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:15:49 -05:00
event + + ;
2005-04-16 15:20:36 -07:00
2008-03-22 00:46:23 -04:00
if ( ! ( flags & MNT_DETACH ) )
shrink_submounts ( mnt , & umount_list ) ;
2005-04-16 15:20:36 -07:00
retval = - EBUSY ;
2005-11-07 17:20:17 -05:00
if ( flags & MNT_DETACH | | ! propagate_mount_busy ( mnt , 2 ) ) {
2005-04-16 15:20:36 -07:00
if ( ! list_empty ( & mnt - > mnt_list ) )
2005-11-07 17:20:17 -05:00
umount_tree ( mnt , 1 , & umount_list ) ;
2005-04-16 15:20:36 -07:00
retval = 0 ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:17:51 -05:00
up_write ( & namespace_sem ) ;
2005-11-07 17:17:04 -05:00
release_mounts ( & umount_list ) ;
2005-04-16 15:20:36 -07:00
return retval ;
}
/*
* Now umount can handle mount points as well as block devices .
* This is important for filesystems which use unnamed block devices .
*
* We now support a flag for forced unmount like the other ' big iron '
* unixes . Our API is identical to OSF / 1 to avoid making a mess of AMD
*/
2009-01-14 14:14:12 +01:00
SYSCALL_DEFINE2 ( umount , char __user * , name , int , flags )
2005-04-16 15:20:36 -07:00
{
2008-07-22 09:59:21 -04:00
struct path path ;
2005-04-16 15:20:36 -07:00
int retval ;
2010-02-10 12:15:53 +01:00
int lookup_flags = 0 ;
2005-04-16 15:20:36 -07:00
2010-02-10 12:15:53 +01:00
if ( flags & ~ ( MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW ) )
return - EINVAL ;
if ( ! ( flags & UMOUNT_NOFOLLOW ) )
lookup_flags | = LOOKUP_FOLLOW ;
retval = user_path_at ( AT_FDCWD , name , lookup_flags , & path ) ;
2005-04-16 15:20:36 -07:00
if ( retval )
goto out ;
retval = - EINVAL ;
2008-07-22 09:59:21 -04:00
if ( path . dentry ! = path . mnt - > mnt_root )
2005-04-16 15:20:36 -07:00
goto dput_and_out ;
2008-07-22 09:59:21 -04:00
if ( ! check_mnt ( path . mnt ) )
2005-04-16 15:20:36 -07:00
goto dput_and_out ;
retval = - EPERM ;
if ( ! capable ( CAP_SYS_ADMIN ) )
goto dput_and_out ;
2008-07-22 09:59:21 -04:00
retval = do_umount ( path . mnt , flags ) ;
2005-04-16 15:20:36 -07:00
dput_and_out :
2008-02-14 19:34:31 -08:00
/* we mustn't call path_put() as that would clear mnt_expiry_mark */
2008-07-22 09:59:21 -04:00
dput ( path . dentry ) ;
mntput_no_expire ( path . mnt ) ;
2005-04-16 15:20:36 -07:00
out :
return retval ;
}
# ifdef __ARCH_WANT_SYS_OLDUMOUNT
/*
2005-11-07 17:16:09 -05:00
* The 2.0 compatible umount . No flags .
2005-04-16 15:20:36 -07:00
*/
2009-01-14 14:14:12 +01:00
SYSCALL_DEFINE1 ( oldumount , char __user * , name )
2005-04-16 15:20:36 -07:00
{
2005-11-07 17:16:09 -05:00
return sys_umount ( name , 0 ) ;
2005-04-16 15:20:36 -07:00
}
# endif
2008-08-02 00:51:11 -04:00
static int mount_is_safe ( struct path * path )
2005-04-16 15:20:36 -07:00
{
if ( capable ( CAP_SYS_ADMIN ) )
return 0 ;
return - EPERM ;
# ifdef notyet
2008-08-02 00:51:11 -04:00
if ( S_ISLNK ( path - > dentry - > d_inode - > i_mode ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
2008-08-02 00:51:11 -04:00
if ( path - > dentry - > d_inode - > i_mode & S_ISVTX ) {
2008-11-14 10:39:05 +11:00
if ( current_uid ( ) ! = path - > dentry - > d_inode - > i_uid )
2005-04-16 15:20:36 -07:00
return - EPERM ;
}
2008-08-02 00:51:11 -04:00
if ( inode_permission ( path - > dentry - > d_inode , MAY_WRITE ) )
2005-04-16 15:20:36 -07:00
return - EPERM ;
return 0 ;
# endif
}
2005-11-07 17:19:50 -05:00
struct vfsmount * copy_tree ( struct vfsmount * mnt , struct dentry * dentry ,
2005-11-07 17:17:22 -05:00
int flag )
2005-04-16 15:20:36 -07:00
{
struct vfsmount * res , * p , * q , * r , * s ;
2008-03-21 20:48:19 -04:00
struct path path ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:21:20 -05:00
if ( ! ( flag & CL_COPY_ALL ) & & IS_MNT_UNBINDABLE ( mnt ) )
return NULL ;
2005-11-07 17:17:22 -05:00
res = q = clone_mnt ( mnt , dentry , flag ) ;
2005-04-16 15:20:36 -07:00
if ( ! q )
goto Enomem ;
q - > mnt_mountpoint = mnt - > mnt_mountpoint ;
p = mnt ;
2005-09-10 00:27:07 -07:00
list_for_each_entry ( r , & mnt - > mnt_mounts , mnt_child ) {
2008-04-29 00:59:40 -07:00
if ( ! is_subdir ( r - > mnt_mountpoint , dentry ) )
2005-04-16 15:20:36 -07:00
continue ;
for ( s = r ; s ; s = next_mnt ( s , r ) ) {
2005-11-07 17:21:20 -05:00
if ( ! ( flag & CL_COPY_ALL ) & & IS_MNT_UNBINDABLE ( s ) ) {
s = skip_mnt_tree ( s ) ;
continue ;
}
2005-04-16 15:20:36 -07:00
while ( p ! = s - > mnt_parent ) {
p = p - > mnt_parent ;
q = q - > mnt_parent ;
}
p = s ;
2008-03-21 20:48:19 -04:00
path . mnt = q ;
path . dentry = p - > mnt_mountpoint ;
2005-11-07 17:17:22 -05:00
q = clone_mnt ( p , p - > mnt_root , flag ) ;
2005-04-16 15:20:36 -07:00
if ( ! q )
goto Enomem ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
list_add_tail ( & q - > mnt_list , & res - > mnt_list ) ;
2008-03-21 20:48:19 -04:00
attach_mnt ( q , & path ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
}
}
return res ;
2005-11-07 17:16:09 -05:00
Enomem :
2005-04-16 15:20:36 -07:00
if ( res ) {
2005-11-07 17:17:04 -05:00
LIST_HEAD ( umount_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:20:17 -05:00
umount_tree ( res , 0 , & umount_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:17:04 -05:00
release_mounts ( & umount_list ) ;
2005-04-16 15:20:36 -07:00
}
return NULL ;
}
2009-04-18 03:28:19 -04:00
struct vfsmount * collect_mounts ( struct path * path )
2007-06-07 12:20:32 -04:00
{
struct vfsmount * tree ;
2008-03-22 16:19:49 -04:00
down_write ( & namespace_sem ) ;
2009-04-18 03:28:19 -04:00
tree = copy_tree ( path - > mnt , path - > dentry , CL_COPY_ALL | CL_PRIVATE ) ;
2008-03-22 16:19:49 -04:00
up_write ( & namespace_sem ) ;
2007-06-07 12:20:32 -04:00
return tree ;
}
void drop_collected_mounts ( struct vfsmount * mnt )
{
LIST_HEAD ( umount_list ) ;
2008-03-22 16:19:49 -04:00
down_write ( & namespace_sem ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2007-06-07 12:20:32 -04:00
umount_tree ( mnt , 0 , & umount_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-03-22 16:19:49 -04:00
up_write ( & namespace_sem ) ;
2007-06-07 12:20:32 -04:00
release_mounts ( & umount_list ) ;
}
2010-01-30 22:51:25 -05:00
int iterate_mounts ( int ( * f ) ( struct vfsmount * , void * ) , void * arg ,
struct vfsmount * root )
{
struct vfsmount * mnt ;
int res = f ( root , arg ) ;
if ( res )
return res ;
list_for_each_entry ( mnt , & root - > mnt_list , mnt_list ) {
res = f ( mnt , arg ) ;
if ( res )
return res ;
}
return 0 ;
}
2008-03-27 13:06:23 +01:00
static void cleanup_group_ids ( struct vfsmount * mnt , struct vfsmount * end )
{
struct vfsmount * p ;
for ( p = mnt ; p ! = end ; p = next_mnt ( p , mnt ) ) {
if ( p - > mnt_group_id & & ! IS_MNT_SHARED ( p ) )
mnt_release_group_id ( p ) ;
}
}
static int invent_group_ids ( struct vfsmount * mnt , bool recurse )
{
struct vfsmount * p ;
for ( p = mnt ; p ; p = recurse ? next_mnt ( p , mnt ) : NULL ) {
if ( ! p - > mnt_group_id & & ! IS_MNT_SHARED ( p ) ) {
int err = mnt_alloc_group_id ( p ) ;
if ( err ) {
cleanup_group_ids ( mnt , p ) ;
return err ;
}
}
}
return 0 ;
}
2005-11-07 17:19:50 -05:00
/*
* @ source_mnt : mount tree to be attached
2005-11-07 17:20:03 -05:00
* @ nd : place the mount tree @ source_mnt is attached
* @ parent_nd : if non - null , detach the source_mnt from its parent and
* store the parent mount and mountpoint dentry .
* ( done when source_mnt is moved )
2005-11-07 17:19:50 -05:00
*
* NOTE : in the table below explains the semantics when a source mount
* of a given type is attached to a destination mount of a given type .
2005-11-07 17:21:20 -05:00
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* | BIND MOUNT OPERATION |
* | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* | source - - > | shared | private | slave | unbindable |
* | dest | | | | |
* | | | | | | |
* | v | | | | |
* | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* | shared | shared ( + + ) | shared ( + ) | shared ( + + + ) | invalid |
* | | | | | |
* | non - shared | shared ( + ) | private | slave ( * ) | invalid |
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
2005-11-07 17:19:50 -05:00
* A bind operation clones the source mount and mounts the clone on the
* destination mount .
*
* ( + + ) the cloned mount is propagated to all the mounts in the propagation
* tree of the destination mount and the cloned mount is added to
* the peer group of the source mount .
* ( + ) the cloned mount is created under the destination mount and is marked
* as shared . The cloned mount is added to the peer group of the source
* mount .
2005-11-07 17:21:01 -05:00
* ( + + + ) the mount is propagated to all the mounts in the propagation tree
* of the destination mount and the cloned mount is made slave
* of the same master as that of the source mount . The cloned mount
* is marked as ' shared and slave ' .
* ( * ) the cloned mount is made a slave of the same master as that of the
* source mount .
*
2005-11-07 17:21:20 -05:00
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* | MOVE MOUNT OPERATION |
* | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* | source - - > | shared | private | slave | unbindable |
* | dest | | | | |
* | | | | | | |
* | v | | | | |
* | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* | shared | shared ( + ) | shared ( + ) | shared ( + + + ) | invalid |
* | | | | | |
* | non - shared | shared ( + * ) | private | slave ( * ) | unbindable |
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
2005-11-07 17:21:01 -05:00
*
* ( + ) the mount is moved to the destination . And is then propagated to
* all the mounts in the propagation tree of the destination mount .
2005-11-07 17:20:03 -05:00
* ( + * ) the mount is moved to the destination .
2005-11-07 17:21:01 -05:00
* ( + + + ) the mount is moved to the destination and is then propagated to
* all the mounts belonging to the destination mount ' s propagation tree .
* the mount is marked as ' shared and slave ' .
* ( * ) the mount continues to be a slave at the new location .
2005-11-07 17:19:50 -05:00
*
* if the source mount is a tree , the operations explained above is
* applied to each mount in the tree .
* Must be called without spinlocks held , since this function can sleep
* in allocations .
*/
static int attach_recursive_mnt ( struct vfsmount * source_mnt ,
2008-03-21 20:48:19 -04:00
struct path * path , struct path * parent_path )
2005-11-07 17:19:50 -05:00
{
LIST_HEAD ( tree_list ) ;
2008-03-21 20:48:19 -04:00
struct vfsmount * dest_mnt = path - > mnt ;
struct dentry * dest_dentry = path - > dentry ;
2005-11-07 17:19:50 -05:00
struct vfsmount * child , * p ;
2008-03-27 13:06:23 +01:00
int err ;
2005-11-07 17:19:50 -05:00
2008-03-27 13:06:23 +01:00
if ( IS_MNT_SHARED ( dest_mnt ) ) {
err = invent_group_ids ( source_mnt , true ) ;
if ( err )
goto out ;
}
err = propagate_mnt ( dest_mnt , dest_dentry , source_mnt , & tree_list ) ;
if ( err )
goto out_cleanup_ids ;
2005-11-07 17:19:50 -05:00
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2010-01-16 12:57:40 -05:00
2005-11-07 17:19:50 -05:00
if ( IS_MNT_SHARED ( dest_mnt ) ) {
for ( p = source_mnt ; p ; p = next_mnt ( p , source_mnt ) )
set_mnt_shared ( p ) ;
}
2008-03-21 20:48:19 -04:00
if ( parent_path ) {
detach_mnt ( source_mnt , parent_path ) ;
attach_mnt ( source_mnt , path ) ;
2009-04-07 12:15:39 -04:00
touch_mnt_namespace ( parent_path - > mnt - > mnt_ns ) ;
2005-11-07 17:20:03 -05:00
} else {
mnt_set_mountpoint ( dest_mnt , dest_dentry , source_mnt ) ;
commit_tree ( source_mnt ) ;
}
2005-11-07 17:19:50 -05:00
list_for_each_entry_safe ( child , p , & tree_list , mnt_hash ) {
list_del_init ( & child - > mnt_hash ) ;
commit_tree ( child ) ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:19:50 -05:00
return 0 ;
2008-03-27 13:06:23 +01:00
out_cleanup_ids :
if ( IS_MNT_SHARED ( dest_mnt ) )
cleanup_group_ids ( source_mnt , NULL ) ;
out :
return err ;
2005-11-07 17:19:50 -05:00
}
2011-03-18 08:55:38 -04:00
static int lock_mount ( struct path * path )
{
struct vfsmount * mnt ;
retry :
mutex_lock ( & path - > dentry - > d_inode - > i_mutex ) ;
if ( unlikely ( cant_mount ( path - > dentry ) ) ) {
mutex_unlock ( & path - > dentry - > d_inode - > i_mutex ) ;
return - ENOENT ;
}
down_write ( & namespace_sem ) ;
mnt = lookup_mnt ( path ) ;
if ( likely ( ! mnt ) )
return 0 ;
up_write ( & namespace_sem ) ;
mutex_unlock ( & path - > dentry - > d_inode - > i_mutex ) ;
path_put ( path ) ;
path - > mnt = mnt ;
path - > dentry = dget ( mnt - > mnt_root ) ;
goto retry ;
}
static void unlock_mount ( struct path * path )
{
up_write ( & namespace_sem ) ;
mutex_unlock ( & path - > dentry - > d_inode - > i_mutex ) ;
}
2008-03-22 18:00:39 -04:00
static int graft_tree ( struct vfsmount * mnt , struct path * path )
2005-04-16 15:20:36 -07:00
{
if ( mnt - > mnt_sb - > s_flags & MS_NOUSER )
return - EINVAL ;
2008-03-22 18:00:39 -04:00
if ( S_ISDIR ( path - > dentry - > d_inode - > i_mode ) ! =
2005-04-16 15:20:36 -07:00
S_ISDIR ( mnt - > mnt_root - > d_inode - > i_mode ) )
return - ENOTDIR ;
2011-03-18 08:55:38 -04:00
if ( d_unlinked ( path - > dentry ) )
return - ENOENT ;
2005-04-16 15:20:36 -07:00
2011-03-18 08:55:38 -04:00
return attach_recursive_mnt ( mnt , path , NULL ) ;
2005-04-16 15:20:36 -07:00
}
2010-08-26 11:07:22 -07:00
/*
* Sanity check the flags to change_mnt_propagation .
*/
static int flags_to_propagation_type ( int flags )
{
fs/namespace.c: bound mount propagation fix
This issue was discovered by users of busybox. And the bug is actual for
busybox users, I don't know how it affects others. Apparently, mount is
called with and without MS_SILENT, and this affects mount() behaviour.
But MS_SILENT is only supposed to affect kernel logging verbosity.
The following script was run in an empty test directory:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1
mount -vv --make-rshared mount.shared1
mount -vv --bind mount.shared2 mount.shared2
mount -vv --make-rshared mount.shared2
mount -vv --bind mount.shared2 mount.shared1
mount -vv --bind mount.dir mount.shared2
ls -R mount.dir mount.shared1 mount.shared2
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
mount -vv was used to show the mount() call arguments and result.
Output shows that flag argument has 0x00008000 = MS_SILENT bit:
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
mount.shared2:
a
b
After adding --loud option to remove MS_SILENT bit from just one mount cmd:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1 2>&1
mount -vv --make-rshared mount.shared1 2>&1
mount -vv --bind mount.shared2 mount.shared2 2>&1
mount -vv --loud --make-rshared mount.shared2 2>&1 # <-HERE
mount -vv --bind mount.shared2 mount.shared1 2>&1
mount -vv --bind mount.dir mount.shared2 2>&1
ls -R mount.dir mount.shared1 mount.shared2 2>&1
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
The result is different now - look closely at mount.shared1 directory listing.
Now it does show files 'a' and 'b':
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x00104000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
a
b
mount.shared2:
a
b
The analysis shows that MS_SILENT flag which is ON by default in any
busybox-> mount operations cames to flags_to_propagation_type function and
causes the error return while is_power_of_2 checking because the function
expects only one bit set. This doesn't allow to do busybox->mount with
any --make-[r]shared, --make-[r]private etc options.
Moreover, the recently added flags_to_propagation_type() function doesn't
allow us to do such operations as --make-[r]private --make-[r]shared etc.
when MS_SILENT is on. The idea or clearing the MS_SILENT flag came from
to Denys Vlasenko.
Signed-off-by: Roman Borisov <ext-roman.borisov@nokia.com>
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-25 16:26:48 -07:00
int type = flags & ~ ( MS_REC | MS_SILENT ) ;
2010-08-26 11:07:22 -07:00
/* Fail if any non-propagation flags are set */
if ( type & ~ ( MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE ) )
return 0 ;
/* Only one propagation flag should be set */
if ( ! is_power_of_2 ( type ) )
return 0 ;
return type ;
}
2005-11-07 17:19:07 -05:00
/*
* recursively change the type of the mountpoint .
*/
2008-08-02 00:55:27 -04:00
static int do_change_type ( struct path * path , int flag )
2005-11-07 17:19:07 -05:00
{
2008-08-02 00:51:11 -04:00
struct vfsmount * m , * mnt = path - > mnt ;
2005-11-07 17:19:07 -05:00
int recurse = flag & MS_REC ;
2010-08-26 11:07:22 -07:00
int type ;
2008-03-27 13:06:23 +01:00
int err = 0 ;
2005-11-07 17:19:07 -05:00
2007-05-08 00:30:40 -07:00
if ( ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
2008-08-02 00:51:11 -04:00
if ( path - > dentry ! = path - > mnt - > mnt_root )
2005-11-07 17:19:07 -05:00
return - EINVAL ;
2010-08-26 11:07:22 -07:00
type = flags_to_propagation_type ( flag ) ;
if ( ! type )
return - EINVAL ;
2005-11-07 17:19:07 -05:00
down_write ( & namespace_sem ) ;
2008-03-27 13:06:23 +01:00
if ( type = = MS_SHARED ) {
err = invent_group_ids ( mnt , recurse ) ;
if ( err )
goto out_unlock ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:19:07 -05:00
for ( m = mnt ; m ; m = ( recurse ? next_mnt ( m , mnt ) : NULL ) )
change_mnt_propagation ( m , type ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-03-27 13:06:23 +01:00
out_unlock :
2005-11-07 17:19:07 -05:00
up_write ( & namespace_sem ) ;
2008-03-27 13:06:23 +01:00
return err ;
2005-11-07 17:19:07 -05:00
}
2005-04-16 15:20:36 -07:00
/*
* do loopback mount .
*/
2008-08-02 00:55:27 -04:00
static int do_loopback ( struct path * path , char * old_name ,
2008-02-08 04:22:12 -08:00
int recurse )
2005-04-16 15:20:36 -07:00
{
2011-03-18 08:55:38 -04:00
LIST_HEAD ( umount_list ) ;
2008-08-02 00:51:11 -04:00
struct path old_path ;
2005-04-16 15:20:36 -07:00
struct vfsmount * mnt = NULL ;
2008-08-02 00:51:11 -04:00
int err = mount_is_safe ( path ) ;
2005-04-16 15:20:36 -07:00
if ( err )
return err ;
if ( ! old_name | | ! * old_name )
return - EINVAL ;
2008-08-02 00:51:11 -04:00
err = kern_path ( old_name , LOOKUP_FOLLOW , & old_path ) ;
2005-04-16 15:20:36 -07:00
if ( err )
return err ;
2011-03-18 08:55:38 -04:00
err = lock_mount ( path ) ;
if ( err )
goto out ;
2005-04-16 15:20:36 -07:00
err = - EINVAL ;
2008-08-02 00:51:11 -04:00
if ( IS_MNT_UNBINDABLE ( old_path . mnt ) )
2011-03-18 08:55:38 -04:00
goto out2 ;
2005-11-07 17:21:20 -05:00
2008-08-02 00:51:11 -04:00
if ( ! check_mnt ( path - > mnt ) | | ! check_mnt ( old_path . mnt ) )
2011-03-18 08:55:38 -04:00
goto out2 ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:15:04 -05:00
err = - ENOMEM ;
if ( recurse )
2008-08-02 00:51:11 -04:00
mnt = copy_tree ( old_path . mnt , old_path . dentry , 0 ) ;
2005-11-07 17:15:04 -05:00
else
2008-08-02 00:51:11 -04:00
mnt = clone_mnt ( old_path . mnt , old_path . dentry , 0 ) ;
2005-11-07 17:15:04 -05:00
if ( ! mnt )
2011-03-18 08:55:38 -04:00
goto out2 ;
2005-11-07 17:15:04 -05:00
2008-08-02 00:51:11 -04:00
err = graft_tree ( mnt , path ) ;
2005-11-07 17:15:04 -05:00
if ( err ) {
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-11-07 17:20:17 -05:00
umount_tree ( mnt , 0 , & umount_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:16:29 -05:00
}
2011-03-18 08:55:38 -04:00
out2 :
unlock_mount ( path ) ;
release_mounts ( & umount_list ) ;
2005-11-07 17:15:04 -05:00
out :
2008-08-02 00:51:11 -04:00
path_put ( & old_path ) ;
2005-04-16 15:20:36 -07:00
return err ;
}
2008-02-15 14:38:00 -08:00
static int change_mount_flags ( struct vfsmount * mnt , int ms_flags )
{
int error = 0 ;
int readonly_request = 0 ;
if ( ms_flags & MS_RDONLY )
readonly_request = 1 ;
if ( readonly_request = = __mnt_is_readonly ( mnt ) )
return 0 ;
if ( readonly_request )
error = mnt_make_readonly ( mnt ) ;
else
__mnt_unmake_readonly ( mnt ) ;
return error ;
}
2005-04-16 15:20:36 -07:00
/*
* change filesystem flags . dir should be a physical root of filesystem .
* If you ' ve mounted a non - root directory somewhere and want to do remount
* on it - tough luck .
*/
2008-08-02 00:55:27 -04:00
static int do_remount ( struct path * path , int flags , int mnt_flags ,
2005-04-16 15:20:36 -07:00
void * data )
{
int err ;
2008-08-02 00:51:11 -04:00
struct super_block * sb = path - > mnt - > mnt_sb ;
2005-04-16 15:20:36 -07:00
if ( ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
2008-08-02 00:51:11 -04:00
if ( ! check_mnt ( path - > mnt ) )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2008-08-02 00:51:11 -04:00
if ( path - > dentry ! = path - > mnt - > mnt_root )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2011-03-03 16:09:14 -05:00
err = security_sb_remount ( sb , data ) ;
if ( err )
return err ;
2005-04-16 15:20:36 -07:00
down_write ( & sb - > s_umount ) ;
2008-02-15 14:38:00 -08:00
if ( flags & MS_BIND )
2008-08-02 00:51:11 -04:00
err = change_mount_flags ( path - > mnt , flags ) ;
2009-05-08 13:36:58 -04:00
else
2008-02-15 14:38:00 -08:00
err = do_remount_sb ( sb , flags , data , 0 ) ;
2010-01-16 13:01:26 -05:00
if ( ! err ) {
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2010-01-26 14:20:47 -05:00
mnt_flags | = path - > mnt - > mnt_flags & MNT_PROPAGATION_MASK ;
2008-08-02 00:51:11 -04:00
path - > mnt - > mnt_flags = mnt_flags ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2010-01-16 13:01:26 -05:00
}
2005-04-16 15:20:36 -07:00
up_write ( & sb - > s_umount ) ;
2008-09-26 19:01:20 -07:00
if ( ! err ) {
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2008-09-26 19:01:20 -07:00
touch_mnt_namespace ( path - > mnt - > mnt_ns ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-09-26 19:01:20 -07:00
}
2005-04-16 15:20:36 -07:00
return err ;
}
2005-11-07 17:21:20 -05:00
static inline int tree_contains_unbindable ( struct vfsmount * mnt )
{
struct vfsmount * p ;
for ( p = mnt ; p ; p = next_mnt ( p , mnt ) ) {
if ( IS_MNT_UNBINDABLE ( p ) )
return 1 ;
}
return 0 ;
}
2008-08-02 00:55:27 -04:00
static int do_move_mount ( struct path * path , char * old_name )
2005-04-16 15:20:36 -07:00
{
2008-08-02 00:51:11 -04:00
struct path old_path , parent_path ;
2005-04-16 15:20:36 -07:00
struct vfsmount * p ;
int err = 0 ;
if ( ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
if ( ! old_name | | ! * old_name )
return - EINVAL ;
2008-08-02 00:51:11 -04:00
err = kern_path ( old_name , LOOKUP_FOLLOW , & old_path ) ;
2005-04-16 15:20:36 -07:00
if ( err )
return err ;
2011-03-18 08:55:38 -04:00
err = lock_mount ( path ) ;
Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14 18:45:26 +00:00
if ( err < 0 )
goto out ;
2005-04-16 15:20:36 -07:00
err = - EINVAL ;
2008-08-02 00:51:11 -04:00
if ( ! check_mnt ( path - > mnt ) | | ! check_mnt ( old_path . mnt ) )
2005-04-16 15:20:36 -07:00
goto out1 ;
2009-05-04 03:32:03 +04:00
if ( d_unlinked ( path - > dentry ) )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
err = - EINVAL ;
2008-08-02 00:51:11 -04:00
if ( old_path . dentry ! = old_path . mnt - > mnt_root )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
2008-08-02 00:51:11 -04:00
if ( old_path . mnt = = old_path . mnt - > mnt_parent )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
2008-08-02 00:51:11 -04:00
if ( S_ISDIR ( path - > dentry - > d_inode - > i_mode ) ! =
S_ISDIR ( old_path . dentry - > d_inode - > i_mode ) )
2005-11-07 17:20:03 -05:00
goto out1 ;
/*
* Don ' t move a mount residing in a shared parent .
*/
2008-08-02 00:51:11 -04:00
if ( old_path . mnt - > mnt_parent & &
IS_MNT_SHARED ( old_path . mnt - > mnt_parent ) )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-11-07 17:21:20 -05:00
/*
* Don ' t move a mount tree containing unbindable mounts to a destination
* mount which is shared .
*/
2008-08-02 00:51:11 -04:00
if ( IS_MNT_SHARED ( path - > mnt ) & &
tree_contains_unbindable ( old_path . mnt ) )
2005-11-07 17:21:20 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
err = - ELOOP ;
2008-08-02 00:51:11 -04:00
for ( p = path - > mnt ; p - > mnt_parent ! = p ; p = p - > mnt_parent )
if ( p = = old_path . mnt )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
2008-08-02 00:51:11 -04:00
err = attach_recursive_mnt ( old_path . mnt , path , & parent_path ) ;
2008-02-14 19:34:32 -08:00
if ( err )
2005-11-07 17:20:03 -05:00
goto out1 ;
2005-04-16 15:20:36 -07:00
/* if the mount is moved, it should no longer be expire
* automatically */
2008-08-02 00:51:11 -04:00
list_del_init ( & old_path . mnt - > mnt_expire ) ;
2005-04-16 15:20:36 -07:00
out1 :
2011-03-18 08:55:38 -04:00
unlock_mount ( path ) ;
2005-04-16 15:20:36 -07:00
out :
if ( ! err )
2008-03-21 20:48:19 -04:00
path_put ( & parent_path ) ;
2008-08-02 00:51:11 -04:00
path_put ( & old_path ) ;
2005-04-16 15:20:36 -07:00
return err ;
}
2011-03-17 22:08:28 -04:00
static struct vfsmount * fs_set_subtype ( struct vfsmount * mnt , const char * fstype )
{
int err ;
const char * subtype = strchr ( fstype , ' . ' ) ;
if ( subtype ) {
subtype + + ;
err = - EINVAL ;
if ( ! subtype [ 0 ] )
goto err ;
} else
subtype = " " ;
mnt - > mnt_sb - > s_subtype = kstrdup ( subtype , GFP_KERNEL ) ;
err = - ENOMEM ;
if ( ! mnt - > mnt_sb - > s_subtype )
goto err ;
return mnt ;
err :
mntput ( mnt ) ;
return ERR_PTR ( err ) ;
}
struct vfsmount *
do_kern_mount ( const char * fstype , int flags , const char * name , void * data )
{
struct file_system_type * type = get_fs_type ( fstype ) ;
struct vfsmount * mnt ;
if ( ! type )
return ERR_PTR ( - ENODEV ) ;
mnt = vfs_kern_mount ( type , flags , name , data ) ;
if ( ! IS_ERR ( mnt ) & & ( type - > fs_flags & FS_HAS_SUBTYPE ) & &
! mnt - > mnt_sb - > s_subtype )
mnt = fs_set_subtype ( mnt , fstype ) ;
put_filesystem ( type ) ;
return mnt ;
}
EXPORT_SYMBOL_GPL ( do_kern_mount ) ;
/*
* add a mount into a namespace ' s mount tree
*/
static int do_add_mount ( struct vfsmount * newmnt , struct path * path , int mnt_flags )
{
int err ;
mnt_flags & = ~ ( MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL ) ;
2011-03-18 08:55:38 -04:00
err = lock_mount ( path ) ;
if ( err )
return err ;
2011-03-17 22:08:28 -04:00
err = - EINVAL ;
if ( ! ( mnt_flags & MNT_SHRINKABLE ) & & ! check_mnt ( path - > mnt ) )
goto unlock ;
/* Refuse the same filesystem on the same mount point */
err = - EBUSY ;
if ( path - > mnt - > mnt_sb = = newmnt - > mnt_sb & &
path - > mnt - > mnt_root = = path - > dentry )
goto unlock ;
err = - EINVAL ;
if ( S_ISLNK ( newmnt - > mnt_root - > d_inode - > i_mode ) )
goto unlock ;
newmnt - > mnt_flags = mnt_flags ;
err = graft_tree ( newmnt , path ) ;
unlock :
2011-03-18 08:55:38 -04:00
unlock_mount ( path ) ;
2011-03-17 22:08:28 -04:00
return err ;
}
2011-01-17 01:47:59 -05:00
2005-04-16 15:20:36 -07:00
/*
* create a new mount for userspace and request it to be added into the
* namespace ' s tree
*/
2008-08-02 00:55:27 -04:00
static int do_new_mount ( struct path * path , char * type , int flags ,
2005-04-16 15:20:36 -07:00
int mnt_flags , char * name , void * data )
{
struct vfsmount * mnt ;
2011-01-17 01:41:58 -05:00
int err ;
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
if ( ! type )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
/* we need capabilities... */
if ( ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
mnt = do_kern_mount ( type , flags , name , data ) ;
if ( IS_ERR ( mnt ) )
return PTR_ERR ( mnt ) ;
2011-01-17 01:41:58 -05:00
err = do_add_mount ( mnt , path , mnt_flags ) ;
if ( err )
mntput ( mnt ) ;
return err ;
2005-04-16 15:20:36 -07:00
}
2011-01-17 01:35:23 -05:00
int finish_automount ( struct vfsmount * m , struct path * path )
{
int err ;
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
*/
BUG_ON ( mnt_get_count ( m ) < 2 ) ;
if ( m - > mnt_sb = = path - > mnt - > mnt_sb & &
m - > mnt_root = = path - > dentry ) {
2011-01-17 01:47:59 -05:00
err = - ELOOP ;
goto fail ;
2011-01-17 01:35:23 -05:00
}
err = do_add_mount ( m , path , path - > mnt - > mnt_flags | MNT_SHRINKABLE ) ;
2011-01-17 01:47:59 -05:00
if ( ! err )
return 0 ;
fail :
/* remove m from any expiration list it may be on */
if ( ! list_empty ( & m - > mnt_expire ) ) {
down_write ( & namespace_sem ) ;
br_write_lock ( vfsmount_lock ) ;
list_del_init ( & m - > mnt_expire ) ;
br_write_unlock ( vfsmount_lock ) ;
up_write ( & namespace_sem ) ;
2011-01-17 01:35:23 -05:00
}
2011-01-17 01:47:59 -05:00
mntput ( m ) ;
mntput ( m ) ;
2011-01-17 01:35:23 -05:00
return err ;
}
2011-01-14 19:10:03 +00:00
/**
* mnt_set_expiry - Put a mount on an expiration list
* @ mnt : The mount to list .
* @ expiry_list : The list to add the mount to .
*/
void mnt_set_expiry ( struct vfsmount * mnt , struct list_head * expiry_list )
{
down_write ( & namespace_sem ) ;
br_write_lock ( vfsmount_lock ) ;
list_add_tail ( & mnt - > mnt_expire , expiry_list ) ;
br_write_unlock ( vfsmount_lock ) ;
up_write ( & namespace_sem ) ;
}
EXPORT_SYMBOL ( mnt_set_expiry ) ;
2005-04-16 15:20:36 -07:00
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren ' t in use and haven ' t been touched since last we came
* here
*/
void mark_mounts_for_expiry ( struct list_head * mounts )
{
struct vfsmount * mnt , * next ;
LIST_HEAD ( graveyard ) ;
2008-03-22 00:21:53 -04:00
LIST_HEAD ( umounts ) ;
2005-04-16 15:20:36 -07:00
if ( list_empty ( mounts ) )
return ;
2008-03-22 00:21:53 -04:00
down_write ( & namespace_sem ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
/* extract from the expiration list every vfsmount that matches the
* following criteria :
* - only referenced by its parent vfsmount
* - still marked for expiry ( marked on the last call here ; marks are
* cleared by mntput ( ) )
*/
2005-07-07 17:57:30 -07:00
list_for_each_entry_safe ( mnt , next , mounts , mnt_expire ) {
2005-04-16 15:20:36 -07:00
if ( ! xchg ( & mnt - > mnt_expiry_mark , 1 ) | |
2008-03-22 00:21:53 -04:00
propagate_mount_busy ( mnt , 1 ) )
2005-04-16 15:20:36 -07:00
continue ;
2005-07-07 17:57:30 -07:00
list_move ( & mnt - > mnt_expire , & graveyard ) ;
2005-04-16 15:20:36 -07:00
}
2008-03-22 00:21:53 -04:00
while ( ! list_empty ( & graveyard ) ) {
mnt = list_first_entry ( & graveyard , struct vfsmount , mnt_expire ) ;
touch_mnt_namespace ( mnt - > mnt_ns ) ;
umount_tree ( mnt , 1 , & umounts ) ;
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-03-22 00:21:53 -04:00
up_write ( & namespace_sem ) ;
release_mounts ( & umounts ) ;
2006-06-09 09:34:17 -04:00
}
EXPORT_SYMBOL_GPL ( mark_mounts_for_expiry ) ;
/*
* Ripoff of ' select_parent ( ) '
*
* search the list of submounts for a given mountpoint , and move any
* shrinkable submounts to the ' graveyard ' list .
*/
static int select_submounts ( struct vfsmount * parent , struct list_head * graveyard )
{
struct vfsmount * this_parent = parent ;
struct list_head * next ;
int found = 0 ;
repeat :
next = this_parent - > mnt_mounts . next ;
resume :
while ( next ! = & this_parent - > mnt_mounts ) {
struct list_head * tmp = next ;
struct vfsmount * mnt = list_entry ( tmp , struct vfsmount , mnt_child ) ;
next = tmp - > next ;
if ( ! ( mnt - > mnt_flags & MNT_SHRINKABLE ) )
2005-04-16 15:20:36 -07:00
continue ;
2006-06-09 09:34:17 -04:00
/*
* Descend a level if the d_mounts list is non - empty .
*/
if ( ! list_empty ( & mnt - > mnt_mounts ) ) {
this_parent = mnt ;
goto repeat ;
}
2005-04-16 15:20:36 -07:00
2006-06-09 09:34:17 -04:00
if ( ! propagate_mount_busy ( mnt , 1 ) ) {
list_move_tail ( & mnt - > mnt_expire , graveyard ) ;
found + + ;
}
2005-04-16 15:20:36 -07:00
}
2006-06-09 09:34:17 -04:00
/*
* All done at this level . . . ascend and resume the search
*/
if ( this_parent ! = parent ) {
next = this_parent - > mnt_child . next ;
this_parent = this_parent - > mnt_parent ;
goto resume ;
}
return found ;
}
/*
* process a list of expirable mountpoints with the intent of discarding any
* submounts of a specific parent mountpoint
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
*
* vfsmount_lock must be held for write
2006-06-09 09:34:17 -04:00
*/
2008-03-22 00:46:23 -04:00
static void shrink_submounts ( struct vfsmount * mnt , struct list_head * umounts )
2006-06-09 09:34:17 -04:00
{
LIST_HEAD ( graveyard ) ;
2008-03-22 00:46:23 -04:00
struct vfsmount * m ;
2006-06-09 09:34:17 -04:00
/* extract submounts of 'mountpoint' from the expiration list */
2008-03-22 00:46:23 -04:00
while ( select_submounts ( mnt , & graveyard ) ) {
2008-03-22 00:21:53 -04:00
while ( ! list_empty ( & graveyard ) ) {
2008-03-22 00:46:23 -04:00
m = list_first_entry ( & graveyard , struct vfsmount ,
2008-03-22 00:21:53 -04:00
mnt_expire ) ;
2008-11-12 13:26:54 -08:00
touch_mnt_namespace ( m - > mnt_ns ) ;
umount_tree ( m , 1 , umounts ) ;
2008-03-22 00:21:53 -04:00
}
}
2005-04-16 15:20:36 -07:00
}
/*
* Some copy_from_user ( ) implementations do not return the exact number of
* bytes remaining to copy on a fault . But copy_mount_options ( ) requires that .
* Note that this function differs from copy_from_user ( ) in that it will oops
* on bad values of ` to ' , rather than returning a short copy .
*/
2005-11-07 17:16:09 -05:00
static long exact_copy_from_user ( void * to , const void __user * from ,
unsigned long n )
2005-04-16 15:20:36 -07:00
{
char * t = to ;
const char __user * f = from ;
char c ;
if ( ! access_ok ( VERIFY_READ , from , n ) )
return n ;
while ( n ) {
if ( __get_user ( c , f ) ) {
memset ( t , 0 , n ) ;
break ;
}
* t + + = c ;
f + + ;
n - - ;
}
return n ;
}
2005-11-07 17:16:09 -05:00
int copy_mount_options ( const void __user * data , unsigned long * where )
2005-04-16 15:20:36 -07:00
{
int i ;
unsigned long page ;
unsigned long size ;
2005-11-07 17:16:09 -05:00
2005-04-16 15:20:36 -07:00
* where = 0 ;
if ( ! data )
return 0 ;
if ( ! ( page = __get_free_page ( GFP_KERNEL ) ) )
return - ENOMEM ;
/* We only care that *some* data at the address the user
* gave us is valid . Just in case , we ' ll zero
* the remainder of the page .
*/
/* copy_from_user cannot cross TASK_SIZE ! */
size = TASK_SIZE - ( unsigned long ) data ;
if ( size > PAGE_SIZE )
size = PAGE_SIZE ;
i = size - exact_copy_from_user ( ( void * ) page , data , size ) ;
if ( ! i ) {
2005-11-07 17:16:09 -05:00
free_page ( page ) ;
2005-04-16 15:20:36 -07:00
return - EFAULT ;
}
if ( i ! = PAGE_SIZE )
memset ( ( char * ) page + i , 0 , PAGE_SIZE - i ) ;
* where = page ;
return 0 ;
}
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
int copy_mount_string ( const void __user * data , char * * where )
{
char * tmp ;
if ( ! data ) {
* where = NULL ;
return 0 ;
}
tmp = strndup_user ( data , PAGE_SIZE ) ;
if ( IS_ERR ( tmp ) )
return PTR_ERR ( tmp ) ;
* where = tmp ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
/*
* Flags is a 32 - bit value that allows up to 31 non - fs dependent flags to
* be given to the mount ( ) call ( ie : read - only , no - dev , no - suid etc ) .
*
* data is a ( void * ) that can point to any structure up to
* PAGE_SIZE - 1 bytes , which can contain arbitrary fs - dependent
* information ( or be NULL ) .
*
* Pre - 0.97 versions of mount ( ) didn ' t have a flags word .
* When the flags word was introduced its top half was required
* to have the magic value 0xC0ED , and this remained so until 2.4 .0 - test9 .
* Therefore , if this magic number is present , it carries no information
* and must be discarded .
*/
2005-11-07 17:16:09 -05:00
long do_mount ( char * dev_name , char * dir_name , char * type_page ,
2005-04-16 15:20:36 -07:00
unsigned long flags , void * data_page )
{
2008-08-02 00:51:11 -04:00
struct path path ;
2005-04-16 15:20:36 -07:00
int retval = 0 ;
int mnt_flags = 0 ;
/* Discard magic */
if ( ( flags & MS_MGC_MSK ) = = MS_MGC_VAL )
flags & = ~ MS_MGC_MSK ;
/* Basic sanity checks */
if ( ! dir_name | | ! * dir_name | | ! memchr ( dir_name , 0 , PAGE_SIZE ) )
return - EINVAL ;
if ( data_page )
( ( char * ) data_page ) [ PAGE_SIZE - 1 ] = 0 ;
2009-10-04 21:49:49 +09:00
/* ... and get the mountpoint */
retval = kern_path ( dir_name , LOOKUP_FOLLOW , & path ) ;
if ( retval )
return retval ;
retval = security_sb_mount ( dev_name , & path ,
type_page , flags , data_page ) ;
if ( retval )
goto dput_out ;
2009-04-19 18:40:43 +02:00
/* Default to relatime unless overriden */
if ( ! ( flags & MS_NOATIME ) )
mnt_flags | = MNT_RELATIME ;
2009-03-26 17:53:14 +00:00
2005-04-16 15:20:36 -07:00
/* Separate the per-mountpoint flags */
if ( flags & MS_NOSUID )
mnt_flags | = MNT_NOSUID ;
if ( flags & MS_NODEV )
mnt_flags | = MNT_NODEV ;
if ( flags & MS_NOEXEC )
mnt_flags | = MNT_NOEXEC ;
2006-01-09 20:52:17 -08:00
if ( flags & MS_NOATIME )
mnt_flags | = MNT_NOATIME ;
if ( flags & MS_NODIRATIME )
mnt_flags | = MNT_NODIRATIME ;
2009-03-26 17:49:56 +00:00
if ( flags & MS_STRICTATIME )
mnt_flags & = ~ ( MNT_RELATIME | MNT_NOATIME ) ;
2008-02-15 14:38:00 -08:00
if ( flags & MS_RDONLY )
mnt_flags | = MNT_READONLY ;
2006-01-09 20:52:17 -08:00
2010-08-09 12:05:43 -04:00
flags & = ~ ( MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
2009-03-26 17:49:56 +00:00
MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_KERNMOUNT |
MS_STRICTATIME ) ;
2005-04-16 15:20:36 -07:00
if ( flags & MS_REMOUNT )
2008-08-02 00:51:11 -04:00
retval = do_remount ( & path , flags & ~ MS_REMOUNT , mnt_flags ,
2005-04-16 15:20:36 -07:00
data_page ) ;
else if ( flags & MS_BIND )
2008-08-02 00:51:11 -04:00
retval = do_loopback ( & path , dev_name , flags & MS_REC ) ;
2005-11-07 17:21:20 -05:00
else if ( flags & ( MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE ) )
2008-08-02 00:51:11 -04:00
retval = do_change_type ( & path , flags ) ;
2005-04-16 15:20:36 -07:00
else if ( flags & MS_MOVE )
2008-08-02 00:51:11 -04:00
retval = do_move_mount ( & path , dev_name ) ;
2005-04-16 15:20:36 -07:00
else
2008-08-02 00:51:11 -04:00
retval = do_new_mount ( & path , type_page , flags , mnt_flags ,
2005-04-16 15:20:36 -07:00
dev_name , data_page ) ;
dput_out :
2008-08-02 00:51:11 -04:00
path_put ( & path ) ;
2005-04-16 15:20:36 -07:00
return retval ;
}
2009-06-22 15:09:13 -04:00
static struct mnt_namespace * alloc_mnt_ns ( void )
{
struct mnt_namespace * new_ns ;
new_ns = kmalloc ( sizeof ( struct mnt_namespace ) , GFP_KERNEL ) ;
if ( ! new_ns )
return ERR_PTR ( - ENOMEM ) ;
atomic_set ( & new_ns - > count , 1 ) ;
new_ns - > root = NULL ;
INIT_LIST_HEAD ( & new_ns - > list ) ;
init_waitqueue_head ( & new_ns - > poll ) ;
new_ns - > event = 0 ;
return new_ns ;
}
2011-01-14 22:30:21 -05:00
void mnt_make_longterm ( struct vfsmount * mnt )
{
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( mnt ) ;
2011-01-14 22:30:21 -05:00
}
void mnt_make_shortterm ( struct vfsmount * mnt )
{
2011-01-16 16:32:11 -05:00
# ifdef CONFIG_SMP
2011-01-14 22:30:21 -05:00
if ( atomic_add_unless ( & mnt - > mnt_longterm , - 1 , 1 ) )
return ;
br_write_lock ( vfsmount_lock ) ;
atomic_dec ( & mnt - > mnt_longterm ) ;
br_write_unlock ( vfsmount_lock ) ;
2011-01-16 16:32:11 -05:00
# endif
2011-01-14 22:30:21 -05:00
}
2006-02-07 12:59:00 -08:00
/*
* Allocate a new namespace structure and populate it with contents
* copied from the namespace of the passed in task structure .
*/
2007-05-08 00:25:21 -07:00
static struct mnt_namespace * dup_mnt_ns ( struct mnt_namespace * mnt_ns ,
2006-12-08 02:37:56 -08:00
struct fs_struct * fs )
2005-04-16 15:20:36 -07:00
{
2006-12-08 02:37:56 -08:00
struct mnt_namespace * new_ns ;
2008-05-10 20:44:54 -04:00
struct vfsmount * rootmnt = NULL , * pwdmnt = NULL ;
2005-04-16 15:20:36 -07:00
struct vfsmount * p , * q ;
2009-06-22 15:09:13 -04:00
new_ns = alloc_mnt_ns ( ) ;
if ( IS_ERR ( new_ns ) )
return new_ns ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:17:51 -05:00
down_write ( & namespace_sem ) ;
2005-04-16 15:20:36 -07:00
/* First pass: copy the tree topology */
2006-12-08 02:37:56 -08:00
new_ns - > root = copy_tree ( mnt_ns - > root , mnt_ns - > root - > mnt_root ,
2005-11-07 17:21:20 -05:00
CL_COPY_ALL | CL_EXPIRE ) ;
2005-04-16 15:20:36 -07:00
if ( ! new_ns - > root ) {
2005-11-07 17:17:51 -05:00
up_write ( & namespace_sem ) ;
2005-04-16 15:20:36 -07:00
kfree ( new_ns ) ;
2008-12-01 14:34:51 -08:00
return ERR_PTR ( - ENOMEM ) ;
2005-04-16 15:20:36 -07:00
}
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
list_add_tail ( & new_ns - > list , & new_ns - > root - > mnt_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-04-16 15:20:36 -07:00
/*
* Second pass : switch the tsk - > fs - > * elements and mark new vfsmounts
* as belonging to new namespace . We have already acquired a private
* fs_struct , so tsk - > fs - > lock is not needed .
*/
2006-12-08 02:37:56 -08:00
p = mnt_ns - > root ;
2005-04-16 15:20:36 -07:00
q = new_ns - > root ;
while ( p ) {
2006-12-08 02:37:56 -08:00
q - > mnt_ns = new_ns ;
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( q ) ;
2005-04-16 15:20:36 -07:00
if ( fs ) {
2008-02-14 19:34:38 -08:00
if ( p = = fs - > root . mnt ) {
2011-01-14 22:30:21 -05:00
fs - > root . mnt = mntget ( q ) ;
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( q ) ;
2011-01-14 22:30:21 -05:00
mnt_make_shortterm ( p ) ;
2005-04-16 15:20:36 -07:00
rootmnt = p ;
}
2008-02-14 19:34:38 -08:00
if ( p = = fs - > pwd . mnt ) {
2011-01-14 22:30:21 -05:00
fs - > pwd . mnt = mntget ( q ) ;
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( q ) ;
2011-01-14 22:30:21 -05:00
mnt_make_shortterm ( p ) ;
2005-04-16 15:20:36 -07:00
pwdmnt = p ;
}
}
2006-12-08 02:37:56 -08:00
p = next_mnt ( p , mnt_ns - > root ) ;
2005-04-16 15:20:36 -07:00
q = next_mnt ( q , new_ns - > root ) ;
}
2005-11-07 17:17:51 -05:00
up_write ( & namespace_sem ) ;
2005-04-16 15:20:36 -07:00
if ( rootmnt )
2011-01-14 22:30:21 -05:00
mntput ( rootmnt ) ;
2005-04-16 15:20:36 -07:00
if ( pwdmnt )
2011-01-14 22:30:21 -05:00
mntput ( pwdmnt ) ;
2005-04-16 15:20:36 -07:00
2006-02-07 12:59:00 -08:00
return new_ns ;
}
2007-07-15 23:41:15 -07:00
struct mnt_namespace * copy_mnt_ns ( unsigned long flags , struct mnt_namespace * ns ,
2007-05-08 00:25:21 -07:00
struct fs_struct * new_fs )
2006-02-07 12:59:00 -08:00
{
2006-12-08 02:37:56 -08:00
struct mnt_namespace * new_ns ;
2006-02-07 12:59:00 -08:00
2007-05-08 00:25:21 -07:00
BUG_ON ( ! ns ) ;
2006-12-08 02:37:56 -08:00
get_mnt_ns ( ns ) ;
2006-02-07 12:59:00 -08:00
if ( ! ( flags & CLONE_NEWNS ) )
2007-05-08 00:25:21 -07:00
return ns ;
2006-02-07 12:59:00 -08:00
2007-05-08 00:25:21 -07:00
new_ns = dup_mnt_ns ( ns , new_fs ) ;
2006-02-07 12:59:00 -08:00
2006-12-08 02:37:56 -08:00
put_mnt_ns ( ns ) ;
2007-05-08 00:25:21 -07:00
return new_ns ;
2005-04-16 15:20:36 -07:00
}
2009-06-22 15:09:13 -04:00
/**
* create_mnt_ns - creates a private namespace and adds a root filesystem
* @ mnt : pointer to the new root filesystem mountpoint
*/
2009-12-17 12:51:05 -08:00
struct mnt_namespace * create_mnt_ns ( struct vfsmount * mnt )
2009-06-22 15:09:13 -04:00
{
struct mnt_namespace * new_ns ;
new_ns = alloc_mnt_ns ( ) ;
if ( ! IS_ERR ( new_ns ) ) {
mnt - > mnt_ns = new_ns ;
2011-01-16 16:32:11 -05:00
__mnt_make_longterm ( mnt ) ;
2009-06-22 15:09:13 -04:00
new_ns - > root = mnt ;
list_add ( & new_ns - > list , & new_ns - > root - > mnt_list ) ;
}
return new_ns ;
}
2009-12-17 12:51:05 -08:00
EXPORT_SYMBOL ( create_mnt_ns ) ;
2009-06-22 15:09:13 -04:00
2009-01-14 14:14:12 +01:00
SYSCALL_DEFINE5 ( mount , char __user * , dev_name , char __user * , dir_name ,
char __user * , type , unsigned long , flags , void __user * , data )
2005-04-16 15:20:36 -07:00
{
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
int ret ;
char * kernel_type ;
char * kernel_dir ;
char * kernel_dev ;
2005-04-16 15:20:36 -07:00
unsigned long data_page ;
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
ret = copy_mount_string ( type , & kernel_type ) ;
if ( ret < 0 )
goto out_type ;
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
kernel_dir = getname ( dir_name ) ;
if ( IS_ERR ( kernel_dir ) ) {
ret = PTR_ERR ( kernel_dir ) ;
goto out_dir ;
}
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
ret = copy_mount_string ( dev_name , & kernel_dev ) ;
if ( ret < 0 )
goto out_dev ;
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
ret = copy_mount_options ( data , & data_page ) ;
if ( ret < 0 )
goto out_data ;
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
ret = do_mount ( kernel_dev , kernel_dir , kernel_type , flags ,
( void * ) data_page ) ;
2005-04-16 15:20:36 -07:00
fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).
(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)
But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?
[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-18 13:05:45 -07:00
free_page ( data_page ) ;
out_data :
kfree ( kernel_dev ) ;
out_dev :
putname ( kernel_dir ) ;
out_dir :
kfree ( kernel_type ) ;
out_type :
return ret ;
2005-04-16 15:20:36 -07:00
}
/*
* pivot_root Semantics :
* Moves the root file system of the current process to the directory put_old ,
* makes new_root as the new root file system of the current process , and sets
* root / cwd of all processes which had them on the current root to new_root .
*
* Restrictions :
* The new_root and put_old must be directories , and must not be on the
* same file system as the current process root . The put_old must be
* underneath new_root , i . e . adding a non - zero number of / . . to the string
* pointed to by put_old must yield the same directory as new_root . No other
* file system may be mounted on put_old . After all , new_root is a mountpoint .
*
2006-01-08 01:03:18 -08:00
* Also , the current root cannot be on the ' rootfs ' ( initial ramfs ) filesystem .
* See Documentation / filesystems / ramfs - rootfs - initramfs . txt for alternatives
* in this situation .
*
2005-04-16 15:20:36 -07:00
* Notes :
* - we don ' t move root / cwd if they are not at the root ( reason : if something
* cared enough to change them , it ' s probably wrong to force them elsewhere )
* - it ' s okay to pick a root that isn ' t the root of a file system , e . g .
* / nfs / my_root where / nfs is the mount point . It must be a mountpoint ,
* though , so you may need to say mount - - bind / nfs / my_root / nfs / my_root
* first .
*/
2009-01-14 14:14:16 +01:00
SYSCALL_DEFINE2 ( pivot_root , const char __user * , new_root ,
const char __user * , put_old )
2005-04-16 15:20:36 -07:00
{
struct vfsmount * tmp ;
2008-07-22 09:59:21 -04:00
struct path new , old , parent_path , root_parent , root ;
2005-04-16 15:20:36 -07:00
int error ;
if ( ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
2008-07-22 09:59:21 -04:00
error = user_path_dir ( new_root , & new ) ;
2005-04-16 15:20:36 -07:00
if ( error )
goto out0 ;
2008-07-22 09:59:21 -04:00
error = user_path_dir ( put_old , & old ) ;
2005-04-16 15:20:36 -07:00
if ( error )
goto out1 ;
2008-07-22 09:59:21 -04:00
error = security_sb_pivotroot ( & old , & new ) ;
2011-03-18 08:55:38 -04:00
if ( error )
goto out2 ;
2005-04-16 15:20:36 -07:00
2010-08-10 11:41:36 +02:00
get_fs_root ( current - > fs , & root ) ;
2011-03-18 08:55:38 -04:00
error = lock_mount ( & old ) ;
if ( error )
goto out3 ;
2005-04-16 15:20:36 -07:00
error = - EINVAL ;
2008-07-22 09:59:21 -04:00
if ( IS_MNT_SHARED ( old . mnt ) | |
IS_MNT_SHARED ( new . mnt - > mnt_parent ) | |
2008-03-22 18:00:39 -04:00
IS_MNT_SHARED ( root . mnt - > mnt_parent ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2011-03-18 08:29:36 -04:00
if ( ! check_mnt ( root . mnt ) | | ! check_mnt ( new . mnt ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2005-04-16 15:20:36 -07:00
error = - ENOENT ;
2009-05-04 03:32:03 +04:00
if ( d_unlinked ( new . dentry ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2009-05-04 03:32:03 +04:00
if ( d_unlinked ( old . dentry ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2005-04-16 15:20:36 -07:00
error = - EBUSY ;
2008-07-22 09:59:21 -04:00
if ( new . mnt = = root . mnt | |
old . mnt = = root . mnt )
2011-03-18 08:55:38 -04:00
goto out4 ; /* loop, on the same file system */
2005-04-16 15:20:36 -07:00
error = - EINVAL ;
2008-03-22 18:00:39 -04:00
if ( root . mnt - > mnt_root ! = root . dentry )
2011-03-18 08:55:38 -04:00
goto out4 ; /* not a mountpoint */
2008-03-22 18:00:39 -04:00
if ( root . mnt - > mnt_parent = = root . mnt )
2011-03-18 08:55:38 -04:00
goto out4 ; /* not attached */
2008-07-22 09:59:21 -04:00
if ( new . mnt - > mnt_root ! = new . dentry )
2011-03-18 08:55:38 -04:00
goto out4 ; /* not a mountpoint */
2008-07-22 09:59:21 -04:00
if ( new . mnt - > mnt_parent = = new . mnt )
2011-03-18 08:55:38 -04:00
goto out4 ; /* not attached */
2008-02-14 19:34:32 -08:00
/* make sure we can reach put_old from new_root */
2008-07-22 09:59:21 -04:00
tmp = old . mnt ;
if ( tmp ! = new . mnt ) {
2005-04-16 15:20:36 -07:00
for ( ; ; ) {
if ( tmp - > mnt_parent = = tmp )
2011-03-18 08:55:38 -04:00
goto out4 ; /* already mounted on put_old */
2008-07-22 09:59:21 -04:00
if ( tmp - > mnt_parent = = new . mnt )
2005-04-16 15:20:36 -07:00
break ;
tmp = tmp - > mnt_parent ;
}
2008-07-22 09:59:21 -04:00
if ( ! is_subdir ( tmp - > mnt_mountpoint , new . dentry ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2008-07-22 09:59:21 -04:00
} else if ( ! is_subdir ( old . dentry , new . dentry ) )
2011-03-18 08:55:38 -04:00
goto out4 ;
2011-03-18 08:29:36 -04:00
br_write_lock ( vfsmount_lock ) ;
2008-07-22 09:59:21 -04:00
detach_mnt ( new . mnt , & parent_path ) ;
2008-03-22 18:00:39 -04:00
detach_mnt ( root . mnt , & root_parent ) ;
2008-02-14 19:34:32 -08:00
/* mount old root on put_old */
2008-07-22 09:59:21 -04:00
attach_mnt ( root . mnt , & old ) ;
2008-02-14 19:34:32 -08:00
/* mount new_root on / */
2008-07-22 09:59:21 -04:00
attach_mnt ( new . mnt , & root_parent ) ;
2006-12-08 02:37:56 -08:00
touch_mnt_namespace ( current - > nsproxy - > mnt_ns ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2008-07-22 09:59:21 -04:00
chroot_fs_refs ( & root , & new ) ;
2005-04-16 15:20:36 -07:00
error = 0 ;
2011-03-18 08:55:38 -04:00
out4 :
unlock_mount ( & old ) ;
if ( ! error ) {
path_put ( & root_parent ) ;
path_put ( & parent_path ) ;
}
out3 :
2008-03-22 18:00:39 -04:00
path_put ( & root ) ;
2011-03-18 08:55:38 -04:00
out2 :
2008-07-22 09:59:21 -04:00
path_put ( & old ) ;
2005-04-16 15:20:36 -07:00
out1 :
2008-07-22 09:59:21 -04:00
path_put ( & new ) ;
2005-04-16 15:20:36 -07:00
out0 :
return error ;
}
static void __init init_mount_tree ( void )
{
struct vfsmount * mnt ;
2006-12-08 02:37:56 -08:00
struct mnt_namespace * ns ;
2008-02-14 19:34:39 -08:00
struct path root ;
2005-04-16 15:20:36 -07:00
mnt = do_kern_mount ( " rootfs " , 0 , " rootfs " , NULL ) ;
if ( IS_ERR ( mnt ) )
panic ( " Can't create rootfs " ) ;
fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:11 +11:00
2009-06-23 17:29:49 -04:00
ns = create_mnt_ns ( mnt ) ;
if ( IS_ERR ( ns ) )
2005-04-16 15:20:36 -07:00
panic ( " Can't allocate initial namespace " ) ;
2006-12-08 02:37:56 -08:00
init_task . nsproxy - > mnt_ns = ns ;
get_mnt_ns ( ns ) ;
2008-02-14 19:34:39 -08:00
root . mnt = ns - > root ;
root . dentry = ns - > root - > mnt_root ;
set_fs_pwd ( current - > fs , & root ) ;
set_fs_root ( current - > fs , & root ) ;
2005-04-16 15:20:36 -07:00
}
2007-10-16 23:26:30 -07:00
void __init mnt_init ( void )
2005-04-16 15:20:36 -07:00
{
2008-02-06 01:37:57 -08:00
unsigned u ;
2006-09-29 01:58:57 -07:00
int err ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:17:51 -05:00
init_rwsem ( & namespace_sem ) ;
2005-04-16 15:20:36 -07:00
mnt_cache = kmem_cache_create ( " mnt_cache " , sizeof ( struct vfsmount ) ,
2007-07-20 10:11:58 +09:00
0 , SLAB_HWCACHE_ALIGN | SLAB_PANIC , NULL ) ;
2005-04-16 15:20:36 -07:00
2005-11-07 17:16:09 -05:00
mount_hashtable = ( struct list_head * ) __get_free_page ( GFP_ATOMIC ) ;
2005-04-16 15:20:36 -07:00
if ( ! mount_hashtable )
panic ( " Failed to allocate mount hash table \n " ) ;
2011-03-22 16:33:54 -07:00
printk ( KERN_INFO " Mount-cache hash table entries: %lu \n " , HASH_SIZE ) ;
2008-02-06 01:37:57 -08:00
for ( u = 0 ; u < HASH_SIZE ; u + + )
INIT_LIST_HEAD ( & mount_hashtable [ u ] ) ;
2005-04-16 15:20:36 -07:00
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_lock_init ( vfsmount_lock ) ;
2006-09-29 01:58:57 -07:00
err = sysfs_init ( ) ;
if ( err )
printk ( KERN_WARNING " %s: sysfs_init error: %d \n " ,
2008-04-30 00:55:09 -07:00
__func__ , err ) ;
2007-10-29 14:17:23 -06:00
fs_kobj = kobject_create_and_add ( " fs " , NULL ) ;
if ( ! fs_kobj )
2008-04-30 00:55:09 -07:00
printk ( KERN_WARNING " %s: kobj create error \n " , __func__ ) ;
2005-04-16 15:20:36 -07:00
init_rootfs ( ) ;
init_mount_tree ( ) ;
}
2009-06-22 15:09:13 -04:00
void put_mnt_ns ( struct mnt_namespace * ns )
2005-04-16 15:20:36 -07:00
{
2005-11-07 17:17:04 -05:00
LIST_HEAD ( umount_list ) ;
2009-06-22 15:09:13 -04:00
2010-02-05 02:21:06 -05:00
if ( ! atomic_dec_and_test ( & ns - > count ) )
2009-06-22 15:09:13 -04:00
return ;
2005-11-07 17:17:51 -05:00
down_write ( & namespace_sem ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_lock ( vfsmount_lock ) ;
2010-02-05 02:21:06 -05:00
umount_tree ( ns - > root , 0 , & umount_list ) ;
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 04:37:39 +10:00
br_write_unlock ( vfsmount_lock ) ;
2005-11-07 17:17:51 -05:00
up_write ( & namespace_sem ) ;
2005-11-07 17:17:04 -05:00
release_mounts ( & umount_list ) ;
2006-12-08 02:37:56 -08:00
kfree ( ns ) ;
2005-04-16 15:20:36 -07:00
}
2009-06-22 15:09:13 -04:00
EXPORT_SYMBOL ( put_mnt_ns ) ;
2011-03-17 22:08:28 -04:00
struct vfsmount * kern_mount_data ( struct file_system_type * type , void * data )
{
2011-07-19 09:32:38 -07:00
struct vfsmount * mnt ;
mnt = vfs_kern_mount ( type , MS_KERNMOUNT , type - > name , data ) ;
if ( ! IS_ERR ( mnt ) ) {
/*
* it is a longterm mount , don ' t release mnt until
* we unmount before file sys is unregistered
*/
mnt_make_longterm ( mnt ) ;
}
return mnt ;
2011-03-17 22:08:28 -04:00
}
EXPORT_SYMBOL_GPL ( kern_mount_data ) ;
2011-07-19 09:32:38 -07:00
void kern_unmount ( struct vfsmount * mnt )
{
/* release long term mount so mount point can be released */
if ( ! IS_ERR_OR_NULL ( mnt ) ) {
mnt_make_shortterm ( mnt ) ;
mntput ( mnt ) ;
}
}
EXPORT_SYMBOL ( kern_unmount ) ;