kernel-ark/Documentation/filesystems
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
..
caching fscache: convert object to use workqueue instead of slow-work 2010-07-22 22:58:34 +02:00
configfs Documentation: make configfs example code simpler, clearer 2010-11-18 15:00:46 -08:00
nfs NFS: rename nfs.upcall -> nfs.idmap 2010-10-26 13:57:10 -04:00
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
9p.txt fs/9p: Add access = client option to opt in acl evaluation. 2010-10-28 09:08:46 -05:00
00-INDEX smbfs: move to drivers/staging 2010-10-05 09:08:21 -07:00
adfs.txt
affs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
afs.txt AFS: Documentation updates 2009-08-19 10:40:13 -07:00
autofs4-mount-control.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
automount-support.txt
befs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
bfs.txt remove mention of CONFIG_KMOD from documentation 2008-07-22 19:24:29 +10:00
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
ceph.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
cifs.txt
coda.txt
cramfs.txt
debugfs.txt Document the debugfs API 2009-06-06 10:28:14 -06:00
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking
dlmfs.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
dnotify_test.c Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
dnotify.txt Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
ecryptfs.txt eCryptfs: Move ecryptfs docs into Documentation/filesystems/ 2007-07-17 10:23:08 -07:00
exofs.txt trivial: some small fixes in exofs documentation 2009-12-10 09:59:16 +02:00
ext2.txt Doc fix: ext2 can only have 32,000 subdirs, not 32,768 2009-06-18 13:03:44 -07:00
ext3.txt ext3: make barrier options consistent with ext4 2010-05-21 19:30:41 +02:00
ext4.txt ext4: add support for lazy inode table initialization 2010-10-27 21:30:05 -04:00
fiemap.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
gfs2-glocks.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
gfs2-uevents.txt GFS2: Add a document explaining GFS2's uevents 2009-08-17 11:11:41 +01:00
gfs2.txt GFS2: docs update 2010-03-29 14:28:52 +01:00
hfs.txt
hfsplus.txt Documentation: document HFSPlus 2007-07-31 15:39:38 -07:00
hpfs.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
inotify.txt
isofs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
jfs.txt
Locking fs: dcache remove dcache_lock 2011-01-07 17:50:23 +11:00
locks.txt Documentation: move locks.txt in filesystems/ 2007-10-09 18:32:45 -04:00
logfs.txt fix "seperate" typos in comments 2010-05-10 11:56:30 +02:00
Makefile Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
mandatory-locking.txt locks: add warning about mandatory locking races 2007-10-09 18:32:45 -04:00
ncpfs.txt ncpfs: remove dead URL from documentation 2009-09-23 07:39:42 -07:00
nilfs2.txt nilfs2: add nodiscard mount option 2010-07-23 10:02:12 +09:00
ntfs.txt NTFS: update homepage 2008-09-02 19:21:37 -07:00
ocfs2.txt ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. 2010-10-11 14:14:55 -07:00
omfs.txt omfs: add filesystem documentation 2008-07-26 12:00:05 -07:00
path-lookup.txt fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
porting fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
proc.txt /proc/pid/pagemap: document in Documentation/filesystems/proc.txt 2010-10-27 18:03:13 -07:00
quota.txt quota: documentation for sending "below quota" messages via netlink and tiny doc update 2008-08-12 16:07:27 -07:00
ramfs-rootfs-initramfs.txt Trivial Documentation/filesystems/ramfs-rootfs-initramfs.txt fix 2008-11-30 11:40:56 -08:00
relay.txt relay: add buffer-only channels; useful for early logging 2008-07-26 12:00:04 -07:00
romfs.txt
seq_file.txt seq_file: use proc_create() in documentation 2009-12-16 07:20:07 -08:00
sharedsubtree.txt Documentation: Fix trivial typo in filesystems/sharedsubtree.txt 2010-10-25 21:18:21 -04:00
spufs.txt
squashfs.txt Squashfs: update Kconfig and documentation for LZO 2010-08-05 23:42:54 +01:00
sysfs-pci.txt PCI: Allow read/write access to sysfs I/O port resources 2010-07-30 09:32:08 -07:00
sysfs-tagging.txt sysfs-namespaces: add a high-level Documentation file 2010-05-21 09:37:31 -07:00
sysfs.txt sysfs: Fix one more signature discrepancy between sysfs implementation and docs. 2010-08-05 13:53:34 -07:00
sysv-fs.txt
tmpfs.txt mempolicy: document cpuset interaction with tmpfs mpol mount option 2010-05-25 08:06:57 -07:00
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt
vfat.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
vfs.txt fs: change d_hash for rcu-walk 2011-01-07 17:50:20 +11:00
xfs-delayed-logging-design.txt xfs: remove experimental tag from the delaylog option 2010-11-10 12:00:47 -06:00
xfs.txt xfs: remove obsolete osyncisosync mount option 2010-07-26 13:16:51 -05:00
xip.txt DOC: update xip method info 2008-11-12 17:17:17 -08:00