2005-04-17 02:20:36 +04:00
/*
* linux / fs / filesystems . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* table of configured filesystems
*/
# include <linux/syscalls.h>
# include <linux/fs.h>
2008-10-04 14:08:37 +04:00
# include <linux/proc_fs.h>
# include <linux/seq_file.h>
2005-04-17 02:20:36 +04:00
# include <linux/kmod.h>
# include <linux/init.h>
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2005-04-17 02:20:36 +04:00
# include <asm/uaccess.h>
/*
* Handling of filesystem drivers list .
* Rules :
* Inclusion to / removals from / scanning of list are protected by spinlock .
* During the unload module must call unregister_filesystem ( ) .
* We can access the fields of list element if :
* 1 ) spinlock is held or
* 2 ) we hold the reference to the module .
* The latter can be guaranteed by call of try_module_get ( ) ; if it
* returned 0 we must skip the element , otherwise we got the reference .
* Once the reference is obtained we can drop the spinlock .
*/
static struct file_system_type * file_systems ;
static DEFINE_RWLOCK ( file_systems_lock ) ;
/* WARNING: This can be used only if we _already_ own a reference */
void get_filesystem ( struct file_system_type * fs )
{
__module_get ( fs - > owner ) ;
}
void put_filesystem ( struct file_system_type * fs )
{
module_put ( fs - > owner ) ;
}
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
static struct file_system_type * * find_filesystem ( const char * name , unsigned len )
2005-04-17 02:20:36 +04:00
{
struct file_system_type * * p ;
for ( p = & file_systems ; * p ; p = & ( * p ) - > next )
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
if ( strlen ( ( * p ) - > name ) = = len & &
strncmp ( ( * p ) - > name , name , len ) = = 0 )
2005-04-17 02:20:36 +04:00
break ;
return p ;
}
/**
* register_filesystem - register a new filesystem
* @ fs : the file system structure
*
* Adds the file system passed to the list of file systems the kernel
* is aware of for mount and other syscalls . Returns 0 on success ,
* or a negative errno code on an error .
*
* The & struct file_system_type that is passed is linked into the kernel
* structures and must not be freed until the file system has been
* unregistered .
*/
int register_filesystem ( struct file_system_type * fs )
{
int res = 0 ;
struct file_system_type * * p ;
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
BUG_ON ( strchr ( fs - > name , ' . ' ) ) ;
2005-04-17 02:20:36 +04:00
if ( fs - > next )
return - EBUSY ;
write_lock ( & file_systems_lock ) ;
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
p = find_filesystem ( fs - > name , strlen ( fs - > name ) ) ;
2005-04-17 02:20:36 +04:00
if ( * p )
res = - EBUSY ;
else
* p = fs ;
write_unlock ( & file_systems_lock ) ;
return res ;
}
EXPORT_SYMBOL ( register_filesystem ) ;
/**
* unregister_filesystem - unregister a file system
* @ fs : filesystem to unregister
*
* Remove a file system that was previously successfully registered
* with the kernel . An error is returned if the file system is not found .
* Zero is returned on a success .
*
* Once this function has returned the & struct file_system_type structure
* may be freed or reused .
*/
int unregister_filesystem ( struct file_system_type * fs )
{
struct file_system_type * * tmp ;
write_lock ( & file_systems_lock ) ;
tmp = & file_systems ;
while ( * tmp ) {
if ( fs = = * tmp ) {
* tmp = fs - > next ;
fs - > next = NULL ;
write_unlock ( & file_systems_lock ) ;
2011-04-14 19:30:08 +04:00
synchronize_rcu ( ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
tmp = & ( * tmp ) - > next ;
}
write_unlock ( & file_systems_lock ) ;
fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 09:49:52 +03:00
2005-04-17 02:20:36 +04:00
return - EINVAL ;
}
EXPORT_SYMBOL ( unregister_filesystem ) ;
static int fs_index ( const char __user * __name )
{
struct file_system_type * tmp ;
2012-10-10 23:25:28 +04:00
struct filename * name ;
2005-04-17 02:20:36 +04:00
int err , index ;
name = getname ( __name ) ;
err = PTR_ERR ( name ) ;
if ( IS_ERR ( name ) )
return err ;
err = - EINVAL ;
read_lock ( & file_systems_lock ) ;
for ( tmp = file_systems , index = 0 ; tmp ; tmp = tmp - > next , index + + ) {
2012-10-10 23:25:28 +04:00
if ( strcmp ( tmp - > name , name - > name ) = = 0 ) {
2005-04-17 02:20:36 +04:00
err = index ;
break ;
}
}
read_unlock ( & file_systems_lock ) ;
putname ( name ) ;
return err ;
}
static int fs_name ( unsigned int index , char __user * buf )
{
struct file_system_type * tmp ;
int len , res ;
read_lock ( & file_systems_lock ) ;
for ( tmp = file_systems ; tmp ; tmp = tmp - > next , index - - )
if ( index < = 0 & & try_module_get ( tmp - > owner ) )
break ;
read_unlock ( & file_systems_lock ) ;
if ( ! tmp )
return - EINVAL ;
/* OK, we got the reference, so we can safely block */
len = strlen ( tmp - > name ) + 1 ;
res = copy_to_user ( buf , tmp - > name , len ) ? - EFAULT : 0 ;
put_filesystem ( tmp ) ;
return res ;
}
static int fs_maxindex ( void )
{
struct file_system_type * tmp ;
int index ;
read_lock ( & file_systems_lock ) ;
for ( tmp = file_systems , index = 0 ; tmp ; tmp = tmp - > next , index + + )
;
read_unlock ( & file_systems_lock ) ;
return index ;
}
/*
* Whee . . Weird sysv syscall .
*/
2009-01-14 16:14:29 +03:00
SYSCALL_DEFINE3 ( sysfs , int , option , unsigned long , arg1 , unsigned long , arg2 )
2005-04-17 02:20:36 +04:00
{
int retval = - EINVAL ;
switch ( option ) {
case 1 :
retval = fs_index ( ( const char __user * ) arg1 ) ;
break ;
case 2 :
retval = fs_name ( arg1 , ( char __user * ) arg2 ) ;
break ;
case 3 :
retval = fs_maxindex ( ) ;
break ;
}
return retval ;
}
2009-04-09 15:17:52 +04:00
int __init get_filesystem_list ( char * buf )
2005-04-17 02:20:36 +04:00
{
int len = 0 ;
struct file_system_type * tmp ;
read_lock ( & file_systems_lock ) ;
tmp = file_systems ;
while ( tmp & & len < PAGE_SIZE - 80 ) {
len + = sprintf ( buf + len , " %s \t %s \n " ,
( tmp - > fs_flags & FS_REQUIRES_DEV ) ? " " : " nodev " ,
tmp - > name ) ;
tmp = tmp - > next ;
}
read_unlock ( & file_systems_lock ) ;
return len ;
}
2008-10-04 14:08:37 +04:00
# ifdef CONFIG_PROC_FS
static int filesystems_proc_show ( struct seq_file * m , void * v )
{
struct file_system_type * tmp ;
read_lock ( & file_systems_lock ) ;
tmp = file_systems ;
while ( tmp ) {
seq_printf ( m , " %s \t %s \n " ,
( tmp - > fs_flags & FS_REQUIRES_DEV ) ? " " : " nodev " ,
tmp - > name ) ;
tmp = tmp - > next ;
}
read_unlock ( & file_systems_lock ) ;
return 0 ;
}
static int filesystems_proc_open ( struct inode * inode , struct file * file )
{
return single_open ( file , filesystems_proc_show , NULL ) ;
}
static const struct file_operations filesystems_proc_fops = {
. open = filesystems_proc_open ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
static int __init proc_filesystems_init ( void )
{
proc_create ( " filesystems " , 0 , NULL , & filesystems_proc_fops ) ;
return 0 ;
}
module_init ( proc_filesystems_init ) ;
# endif
2008-12-25 08:32:15 +03:00
static struct file_system_type * __get_fs_type ( const char * name , int len )
2005-04-17 02:20:36 +04:00
{
struct file_system_type * fs ;
read_lock ( & file_systems_lock ) ;
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
fs = * ( find_filesystem ( name , len ) ) ;
2005-04-17 02:20:36 +04:00
if ( fs & & ! try_module_get ( fs - > owner ) )
fs = NULL ;
read_unlock ( & file_systems_lock ) ;
2008-12-25 08:32:15 +03:00
return fs ;
}
struct file_system_type * get_fs_type ( const char * name )
{
struct file_system_type * fs ;
const char * dot = strchr ( name , ' . ' ) ;
int len = dot ? dot - name : strlen ( name ) ;
fs = __get_fs_type ( name , len ) ;
if ( ! fs & & ( request_module ( " %.*s " , len , name ) = = 0 ) )
fs = __get_fs_type ( name , len ) ;
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:25:43 +04:00
if ( dot & & fs & & ! ( fs - > fs_flags & FS_HAS_SUBTYPE ) ) {
put_filesystem ( fs ) ;
fs = NULL ;
}
2005-04-17 02:20:36 +04:00
return fs ;
}
EXPORT_SYMBOL ( get_fs_type ) ;