kernel-ark/fs
Davide Libenzi 6192bd536f epoll: optimizations and cleanups
Epoll is doing multiple passes over the ready set at the moment, because of
the constraints over the f_op->poll() call.  Looking at the code again, I
noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).

This is a stress application that can be used to test the new code.  It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads.  Stress tested on my dual Opteron 254 w/out any problems.

http://www.xmailserver.org/totalmess.c

This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:

http://www.xmailserver.org/epwbench.c

[1] Considering that epoll must be thread-safe, there are five ways we can
    be hit during an epoll_wait() transfer loop (ep_send_events()):

    1) The epoll fd going away and calling ep_free
       This just can't happen, since we did an fget() in sys_epoll_wait

    2) An epoll_ctl(EPOLL_CTL_DEL)
       This can't happen because epoll_ctl() gets ep->sem in write, and
       we're holding it in read during ep_send_events()

    3) An fd stored inside the epoll fd going away
       This can't happen because in eventpoll_release_file() we get
       ep->sem in write, and we're holding it in read during
       ep_send_events()

    4) Another epoll_wait() happening on another thread
       They both can be inside ep_send_events() at the same time, we get
       (splice) the ready-list under the spinlock, so each one will get
       its own ready list. Note that an fd cannot be at the same time
       inside more than one ready list, because ep_poll_callback() will
       not re-queue it if it sees it already linked:

       if (ep_is_linked(&epi->rdllink))
                goto is_linked;

       Another case that can happen, is two concurrent epoll_wait(),
       coming in with a userspace event buffer of size, say, ten.
       Suppose there are 50 event ready in the list. The first
       epoll_wait() will "steal" the whole list, while the second, seeing
       no events, will go to sleep. But at the end of ep_send_events() in
       the first epoll_wait(), we will re-inject surplus ready fds, and we
       will trigger the proper wake_up to the second epoll_wait().

    5) ep_poll_callback() hitting us asyncronously
       This is the tricky part. As I said above, the ep_is_linked() test
       done inside ep_poll_callback(), will guarantee us that until the
       item will result linked to a list, ep_poll_callback() will not try
       to re-queue it again (read, write data on any of its members). When
       we do a list_del() in ep_send_events(), the item will still satisfy
       the ep_is_linked() test (whatever data is written in prev/next,
       it'll never be its own pointer), so ep_poll_callback() will still
       leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
       that it'll become visible to ep_poll_callback(), but at the point
       we're already past it.

[akpm@osdl.org: 80 cols]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:01 -07:00
..
9p v9fs: don't use primary fid when removing file 2007-04-24 08:23:08 -07:00
adfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
affs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
afs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
autofs [PATCH] Mark struct super_operations const 2007-02-12 09:48:47 -08:00
autofs4 [PATCH] autofs4: fix race in unhashed dentry code 2007-04-12 15:31:42 -07:00
befs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
bfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
cifs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
coda slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
configfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
cramfs mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
debugfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
devpts devpts: add fsnotify create event 2007-05-08 11:14:59 -07:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw 2007-05-07 12:26:27 -07:00
ecryptfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
efs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
exportfs
ext2 ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systems 2007-05-08 11:14:58 -07:00
ext3 ext3: dirindex error pointer issues 2007-05-08 11:15:01 -07:00
ext4 ext3: dirindex error pointer issues 2007-05-08 11:15:01 -07:00
fat is_power_of_2 in fat 2007-05-08 11:14:59 -07:00
freevxfs freevxfs: possible null pointer dereference fix 2007-05-08 11:14:59 -07:00
fuse Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
gfs2 Factor outstanding I/O error handling 2007-05-08 11:14:57 -07:00
hfs is_power_of_2 in fs/hfs 2007-05-08 11:14:59 -07:00
hfsplus is_power_of_2 in fs/hfs 2007-05-08 11:14:59 -07:00
hostfs uml: hostfs style fixes 2007-05-08 11:14:57 -07:00
hpfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
hppfs [PATCH] Mark struct super_operations const 2007-02-12 09:48:47 -08:00
hugetlbfs hugetlbfs: add NULL check in hugetlb_zero_setup() 2007-05-07 12:12:57 -07:00
isofs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
jbd [PATCH] jbd: wait for already submitted t_sync_datalist buffer to complete 2006-12-22 08:55:51 -08:00
jbd2 [PATCH] jbd2: wait for already submitted t_sync_datalist buffer to complete 2006-12-07 08:39:42 -08:00
jffs2 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
jfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
lockd Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
minix slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
msdos [PATCH] mark struct inode_operations const 2 2007-02-12 09:48:46 -08:00
ncpfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
nfs Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
nfs_common [PATCH] nfs_common endianness annotations 2006-10-20 10:26:41 -07:00
nfsd Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
nls [PATCH] fs: make nls_cp936.c handle some U00XY characters and U20AC correctly 2006-12-07 08:39:46 -08:00
ntfs mm: move common segment checks to separate helper function 2007-05-08 11:14:57 -07:00
ocfs2 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
openpromfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
partitions mm: optimize acorn partition truncate 2007-05-07 12:12:55 -07:00
proc reduce size of task_struct on 64-bit machines 2007-05-08 11:14:58 -07:00
qnx4 slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
ramfs [PATCH] Mark struct super_operations const 2007-02-12 09:48:47 -08:00
reiserfs reiserfs: correct misspelled "REISERFS_PROC_INFO" to "CONFIG_REISERFS_PROC_INFO" 2007-05-08 11:15:00 -07:00
romfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
smbfs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
sysfs remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
sysv slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
udf slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
ufs slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
vfat [PATCH] mark struct inode_operations const 3 2007-02-12 09:48:46 -08:00
xfs mm: move common segment checks to separate helper function 2007-05-08 11:14:57 -07:00
aio.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
attr.c
bad_inode.c [PATCH] mark struct inode_operations const 1 2007-02-12 09:48:46 -08:00
binfmt_aout.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
binfmt_elf_fdpic.c [PATCH] fix page leak during core dump 2007-04-02 10:06:08 -07:00
binfmt_elf.c [PATCH] fix page leak during core dump 2007-04-02 10:06:08 -07:00
binfmt_em86.c
binfmt_flat.c [PATCH] uclinux: correctly remap bin_fmtflat exe allocated mem regions 2007-02-09 10:45:33 -08:00
binfmt_misc.c [PATCH] Mark struct super_operations const 2007-02-12 09:48:47 -08:00
binfmt_script.c
binfmt_som.c [PARISC] Fix fs/binfmt_som.c 2006-10-04 06:51:26 -06:00
bio.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
block_dev.c is_power_of_2 in fs/block_dev.c 2007-05-08 11:14:59 -07:00
buffer.c block_write_full_page(): report ENOSPC 2007-05-08 11:14:57 -07:00
char_dev.c [PATCH] remove protection of LANANA-reserved majors 2007-04-04 21:12:47 -07:00
compat_ioctl.c [PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems 2007-05-02 19:27:21 +02:00
compat.c [PATCH] x86-64: Print type and size correctly for unknown compat ioctls 2007-05-02 19:27:21 +02:00
dcache.c mm: shrink parent dentries when shrinking slab 2007-05-08 11:14:58 -07:00
dcookies.c [PATCH] slab: remove kmem_cache_t 2006-12-07 08:39:25 -08:00
direct-io.c [PATCH] dio: lock refcount operations 2006-12-10 09:57:21 -08:00
dnotify.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
dquot.c mm: remove destroy_dirty_buffers from invalidate_bdev() 2007-05-07 12:12:55 -07:00
drop_caches.c [PATCH] remove invalidate_inode_pages() 2007-02-11 10:51:31 -08:00
eventpoll.c epoll: optimizations and cleanups 2007-05-08 11:15:01 -07:00
exec.c exec: fix remove_arg_zero 2007-05-08 11:15:00 -07:00
fcntl.c [PATCH] fdtable: Make fdarray and fdsets equal in size 2006-12-10 09:57:22 -08:00
fifo.c
file_table.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
file.c [PATCH] fdtable: Provide free_fdtable() wrapper 2006-12-22 08:55:50 -08:00
filesystems.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
fs-writeback.c Write back inode data pages even when the inode itself is locked 2007-01-26 12:53:20 -08:00
generic_acl.c
inode.c slab allocators: Remove SLAB_DEBUG_INITIAL flag 2007-05-07 12:12:57 -07:00
inotify_user.c [PATCH] inotify: read return val fix 2007-02-12 09:48:28 -08:00
inotify.c [PATCH] severing fs.h, radix-tree.h -> sched.h 2006-12-04 02:00:24 -05:00
internal.h
ioctl.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
ioprio.c [PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task 2007-02-12 09:48:32 -08:00
Kconfig Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 2007-05-04 19:55:11 -07:00
Kconfig.binfmt blackfin architecture 2007-05-07 12:12:58 -07:00
libfs.c [PATCH] shmem and simple const super_operations 2007-03-05 07:57:51 -08:00
locks.c Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux 2007-05-07 12:34:24 -07:00
Makefile Remove JFFS (version 1), as scheduled. 2007-02-17 16:10:59 -05:00
mbcache.c [PATCH] slab: remove kmem_cache_t 2006-12-07 08:39:25 -08:00
mpage.c Factor outstanding I/O error handling 2007-05-08 11:14:57 -07:00
namei.c mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
namespace.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
nfsctl.c
no-block.c
open.c [PATCH] fdtable: Make fdarray and fdsets equal in size 2006-12-10 09:57:22 -08:00
pipe.c [PATCH] AUDIT_FD_PAIR 2007-02-17 21:30:15 -05:00
pnode.c [PATCH] rename struct namespace to struct mnt_namespace 2006-12-08 08:28:51 -08:00
pnode.h [PATCH] rename struct namespace to struct mnt_namespace 2006-12-08 08:28:51 -08:00
posix_acl.c
quota_v1.c
quota_v2.c
quota.c
read_write.c use use SEEK_MAX to validate user lseek arguments 2007-05-08 11:14:59 -07:00
read_write.h
readdir.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
select.c [PATCH] fdtable: Make fdarray and fdsets equal in size 2006-12-10 09:57:22 -08:00
seq_file.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
splice.c [PATCH] splice: partial write fix 2007-03-29 14:26:42 +02:00
stack.c [PATCH] fs/stack.c: Copy i_nlink after all other attributes are copied 2007-02-19 14:21:50 -08:00
stat.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00
super.c the overdue removal of the mount/umount uevents 2007-04-27 10:57:31 -07:00
sync.c [PATCH] Turn do_sync_file_range() into do_sync_mapping_range() 2007-04-26 15:02:26 -07:00
utimes.c [PATCH] severing fs.h, radix-tree.h -> sched.h 2006-12-04 02:00:24 -05:00
xattr_acl.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
xattr.c [PATCH] VFS: change struct file to use struct path 2006-12-08 08:28:41 -08:00