linux/fs
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
..
9p fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
adfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
affs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
afs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
autofs4 fs: dcache remove dcache_lock 2011-01-07 17:50:23 +11:00
befs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
bfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
btrfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
cachefiles llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
ceph fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
cifs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
coda fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
configfs fs: dcache rationalise dget variants 2011-01-07 17:50:24 +11:00
cramfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
debugfs convert get_sb_single() users 2010-10-29 04:16:28 -04:00
devpts convert get_sb_single() users 2010-10-29 04:16:28 -04:00
dlm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm 2010-10-22 17:33:16 -07:00
ecryptfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
efs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
exofs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
exportfs fs: dcache rationalise dget variants 2011-01-07 17:50:24 +11:00
ext2 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ext3 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ext4 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
fat fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
freevxfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
fscache Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
fuse fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
gfs2 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hfsplus fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hostfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hpfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hppfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
hugetlbfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
isofs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-10-27 20:13:18 -07:00
jbd2 jbd2: fix /proc/fs/jbd2/<dev> when using an external journal 2010-11-17 21:46:26 -05:00
jffs2 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
jfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
lockd BKL: remove extraneous #include <smp_lock.h> 2010-11-17 08:59:32 -08:00
logfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
minix fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ncpfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
nfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
nfs_common
nfsd fs: dcache scale dentry refcount 2011-01-07 17:50:21 +11:00
nilfs2 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
nls
notify fs: dcache remove dcache_lock 2011-01-07 17:50:23 +11:00
ntfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ocfs2 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
omfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
openpromfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
partitions Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block 2010-10-25 07:45:10 -07:00
proc fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
qnx4 fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
quota quota: Fix possible oops in __dquot_initialize() 2010-10-28 01:30:06 +02:00
ramfs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
reiserfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
romfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
squashfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
sysfs fs: change d_delete semantics 2011-01-07 17:50:18 +11:00
sysv fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ubifs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
udf fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
ufs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
xfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
aio.c new helper: ihold() 2010-10-25 21:26:11 -04:00
anon_inodes.c convert get_sb_pseudo() users 2010-10-29 04:16:33 -04:00
attr.c check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
bad_inode.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
binfmt_aout.c Don't dump task struct in a.out core-dumps 2010-10-14 10:57:40 -07:00
binfmt_elf_fdpic.c
binfmt_elf.c ARM: 6342/1: fix ASLR of PIE executables 2010-10-08 10:02:53 +01:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c convert get_sb_single() users 2010-10-29 04:16:28 -04:00
binfmt_script.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
binfmt_som.c
bio-integrity.c fs/bio-integrity.c: return -ENOMEM on kmalloc failure 2010-08-23 13:36:59 +02:00
bio.c bio: take care not overflow page count when mapping/copying user data 2010-11-10 14:40:43 +01:00
block_dev.c fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
buffer.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-10-26 17:58:44 -07:00
char_dev.c Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl 2010-10-22 10:52:56 -07:00
compat_binfmt_elf.c
compat_ioctl.c BKL: remove extraneous #include <smp_lock.h> 2010-11-17 08:59:32 -08:00
compat.c exec: copy-and-paste the fixes into compat_do_execve() paths 2010-11-30 17:56:38 -08:00
dcache.c fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
dcookies.c
direct-io.c fs/direct-io.c: fix truncation error in dio_complete() return 2010-10-26 16:52:13 -07:00
drop_caches.c simplify checks for I_CLEAR/I_FREEING 2010-08-09 16:47:44 -04:00
eventfd.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
eventpoll.c epoll: make epoll_wait() use the hrtimer range feature 2010-10-27 18:03:18 -07:00
exec.c install_special_mapping skips security_file_mmap check. 2010-12-15 12:30:36 -08:00
fcntl.c fasync: Fix placement of FASYNC flag comment 2010-10-27 18:17:02 -07:00
fifo.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
file_table.c fs: allow for more than 2^31 files 2010-10-26 16:52:15 -07:00
file.c vfs: use kmalloc() to allocate fdmem if possible 2010-08-11 08:59:02 -07:00
filesystems.c fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
fs_struct.c fs: fs_struct rwlock to spinlock 2010-08-18 08:35:46 -04:00
fs-writeback.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2010-10-30 09:05:48 -07:00
generic_acl.c vfs: update ctime when changing the file's permission by setfacl 2010-08-18 01:04:22 -04:00
inode.c fs: avoid inode RCU freeing for pseudo fs 2011-01-07 17:50:26 +11:00
internal.h braino in internal.h 2010-10-29 05:49:13 -04:00
ioctl.c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2010-11-19 19:46:45 -08:00
ioprio.c ioprio: grab rcu_read_lock in sys_ioprio_{set,get}() 2010-11-15 10:23:31 +01:00
Kconfig Merge 'staging-next' to Linus's tree 2010-10-28 09:44:56 -07:00
Kconfig.binfmt coredump: default CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y 2010-10-27 18:03:12 -07:00
libfs.c fs: dcache remove dcache_lock 2011-01-07 17:50:23 +11:00
locks.c fs: dcache scale dentry refcount 2011-01-07 17:50:21 +11:00
Makefile Merge 'staging-next' to Linus's tree 2010-10-28 09:44:56 -07:00
mbcache.c mbcache: Limit the maximum number of cache entries 2010-08-18 06:24:41 -04:00
mpage.c
namei.c fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
namespace.c BKL: remove extraneous #include <smp_lock.h> 2010-11-17 08:59:32 -08:00
nfsctl.c
no-block.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
open.c fix open/umount race 2010-10-29 04:14:56 -04:00
pipe.c fs: avoid inode RCU freeing for pseudo fs 2011-01-07 17:50:26 +11:00
pnode.c fs: brlock vfsmount_lock 2010-08-18 08:35:48 -04:00
pnode.h
posix_acl.c
read_write.c BKL: remove extraneous #include <smp_lock.h> 2010-11-17 08:59:32 -08:00
read_write.h
readdir.c vfs: fix warning: 'dirent' is used uninitialized in this function 2010-08-09 20:45:05 -07:00
select.c epoll: make epoll_wait() use the hrtimer range feature 2010-10-27 18:03:18 -07:00
seq_file.c fs: take dcache_lock inside __d_path 2010-10-25 21:26:12 -04:00
signalfd.c Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2010-10-26 10:13:10 -07:00
splice.c Export 'get_pipe_info()' to other users 2010-11-28 14:09:57 -08:00
stack.c
stat.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
statfs.c add f_flags to struct statfs(64) 2010-08-09 16:48:44 -04:00
super.c switch get_sb_ns() users 2010-10-29 04:17:03 -04:00
sync.c get rid of file_fsync() 2010-08-09 16:47:43 -04:00
timerfd.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
utimes.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
xattr_acl.c
xattr.c