kernel-ark/fs
Quentin Barnes 6cb2a21049 aio: bad AIO race in aio_complete() leads to process hang
My group ran into a AIO process hang on a 2.6.24 kernel with the process
sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
and it never would.

We ran the tests on x86_64 SMP.  The hang only occurred on a Xeon box
("Clovertown") but not a Core2Duo ("Conroe").  On the Xeon, the L2 cache isn't
shared between all eight processors, but is L2 is shared between between all
two processors on the Core2Duo we use.

My analysis of the hang is if you go down to the second while-loop
in read_events(), what happens on processor #1:
	1) add_wait_queue_exclusive() adds thread to ctx->wait
	2) aio_read_evt() to check tail
	3) if aio_read_evt() returned 0, call [io_]schedule() and sleep

In aio_complete() with processor #2:
	A) info->tail = tail;
	B) waitqueue_active(&ctx->wait)
	C) if waitqueue_active() returned non-0, call wake_up()

The way the code is written, step 1 must be seen by all other processors
before processor 1 checks for pending events in step 2 (that were recorded by
step A) and step A by processor 2 must be seen by all other processors
(checked in step 2) before step B is done.

The race I believed I was seeing is that steps 1 and 2 were
effectively swapped due to the __list_add() being delayed by the L2
cache not shared by some of the other processors.  Imagine:
proc 2: just before step A
proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
proc 1, step 2: checks tail and sees no pending events
proc 2, step A: updates tail
proc 1, step 3: calls [io_]schedule() and sleeps
proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
                so proc 1 sleeps indefinitely

My patch adds a memory barrier between steps A and B.  It ensures that the
update in step 1 gets seen on processor 2 before continuing.  If processor 1
was just before step 1, the memory barrier makes sure that step A (update
tail) gets seen by the time processor 1 makes it to step 2 (check tail).

Before the patch our AIO process would hang virtually 100% of the time.  After
the patch, we have yet to see the process ever hang.

Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com>
Reviewed-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ We should probably disallow that "if (waitqueue_active()) wake_up()"
  coding pattern, because it's so often buggy wrt memory ordering ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19 18:53:35 -07:00
..
9p Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p) 2008-02-07 08:42:26 -08:00
adfs mount options: fix adfs 2008-02-08 09:22:39 -08:00
affs mount options: fix affs 2008-02-08 09:22:39 -08:00
afs Use path_put() in a few places instead of {mnt,d}put() 2008-02-14 21:13:33 -08:00
autofs mount options: fix autofs 2008-02-08 09:22:40 -08:00
autofs4 Introduce path_put() 2008-02-14 21:13:33 -08:00
befs mount options: fix befs 2008-02-08 09:22:40 -08:00
bfs iget: stop BFS from using iget() and read_inode() 2008-02-07 08:42:27 -08:00
cifs [CIFS] remove unused variable 2008-02-26 03:44:02 +00:00
coda Introduce path_put() 2008-02-14 21:13:33 -08:00
configfs Introduce path_put() 2008-02-14 21:13:33 -08:00
cramfs
debugfs debugfs: fix sparse warnings 2008-03-04 14:47:06 -08:00
devpts mount options: fix devpts 2008-02-08 09:22:40 -08:00
dlm dlm: fix rcom_names message to self 2008-02-21 15:19:54 -06:00
ecryptfs eCryptfs: make ecryptfs_prepare_write decrypt the page 2008-03-04 16:35:16 -08:00
efs efs: move headers out of include/linux/ 2008-02-23 17:12:15 -08:00
exportfs
ext2 mount options: fix ext2 2008-02-08 09:22:40 -08:00
ext3 ext3: fix mount option parsing 2008-03-04 16:35:18 -08:00
ext4 ext4: add missing ext4_journal_stop() 2008-02-25 15:37:42 -05:00
fat mount options: fix fat 2008-02-08 09:22:40 -08:00
freevxfs iget: stop FreeVXFS from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
fuse fuse: fix permission checking 2008-02-23 17:12:13 -08:00
gfs2 Introduce path_put() 2008-02-14 21:13:33 -08:00
hfs hfs_bnode_find() can fail, resulting in hfs_bnode_split() breakage 2008-03-17 09:46:55 -07:00
hfsplus fs/hfsplus/unicode.c: fix uninitialized var warning 2008-02-08 09:22:36 -08:00
hostfs uml: fix hostfs tv_usec calculations 2008-02-05 09:44:30 -08:00
hpfs mount options: fix hpfs 2008-02-08 09:22:40 -08:00
hppfs iget: stop HPPFS from using iget() and read_inode() 2008-02-07 08:42:29 -08:00
hugetlbfs hugetlb: allow sticky directory mount option 2008-02-05 09:44:14 -08:00
isofs mount options: fix isofs 2008-02-08 09:22:40 -08:00
jbd docbook: fix filesystems.tmpl source files 2008-03-03 10:47:13 -08:00
jbd2 JBD2: Clear buffer_ordered flag for barried IO request on success 2008-02-10 01:09:32 -05:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2008-02-07 10:20:31 -08:00
jfs BKL-removal: Implement a compat_ioctl handler for JFS 2008-02-07 13:45:29 -06:00
lockd Wrap buffers used for rpc debug printks into RPC_IFDEBUG 2008-02-21 18:42:29 -05:00
minix iget: stop the MINIX filesystem from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
msdos
ncpfs mount options: fix ncpfs 2008-02-08 09:22:40 -08:00
nfs Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 2008-03-07 12:08:07 -08:00
nfs_common
nfsd nfsd: fix oops on access from high-numbered ports 2008-03-14 16:49:15 -07:00
nls
ntfs is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
ocfs2 ocfs2: Fix NULL pointer dereferences in o2net 2008-03-10 15:14:19 -07:00
openpromfs iget: stop OPENPROMFS from using iget() and read_inode() 2008-02-07 08:42:29 -08:00
partitions Enhanced partition statistics: remove old partition statistics 2008-02-08 12:42:01 +01:00
proc [PATCH] export sessionid alongside the loginuid in procfs 2008-03-18 10:51:22 -04:00
qnx4 iget: stop QNX4 from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
ramfs
reiserfs fs/reiserfs/super.c: correct use of ! and & 2008-03-04 16:35:16 -08:00
romfs iget: stop ROMFS from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
smbfs fs/smbfs/inode.c: fix warning message deprecating smbfs 2008-02-13 16:21:19 -08:00
sysfs sysfs: remove BUG_ON() from sysfs_remove_group() 2008-02-07 11:31:46 -08:00
sysv iget: stop the SYSV filesystem from using iget() and read_inode() 2008-02-07 08:42:29 -08:00
udf udf: fix udf_add_free_space 2008-02-13 16:21:20 -08:00
ufs ufs: fix parenthesisation in ufs_set_fs_state() 2008-02-23 17:12:13 -08:00
vfat Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p) 2008-02-07 08:42:26 -08:00
xfs [XFS] fix inode leak in xfs_iget_core() 2008-03-06 16:38:50 +11:00
aio.c aio: bad AIO race in aio_complete() leads to process hang 2008-03-19 18:53:35 -07:00
anon_inodes.c
attr.c
bad_inode.c iget: introduce a function to register iget failure 2008-02-07 08:42:26 -08:00
binfmt_aout.c aout: suppress A.OUT library support if !CONFIG_ARCH_SUPPORTS_AOUT 2008-02-08 09:22:30 -08:00
binfmt_elf_fdpic.c
binfmt_elf.c core dump: user_regset writeback 2008-03-04 16:35:10 -08:00
binfmt_em86.c
binfmt_flat.c FLAT binaries: drop BINFMT_FLAT bad header magic warning 2008-02-14 20:58:05 -08:00
binfmt_misc.c
binfmt_script.c
binfmt_som.c aout: remove unnecessary inclusions of {asm, linux}/a.out.h 2008-02-08 09:22:30 -08:00
bio.c Revert "unexport bio_{,un}map_user" 2008-03-17 21:14:40 +01:00
block_dev.c fs/block_dev.c: remove #if 0'ed code 2008-02-19 10:04:00 +01:00
buffer.c vfs: fix NULL pointer dereference in fsync_buffers_list() 2008-03-04 16:35:10 -08:00
char_dev.c fs/char_dev.c: chrdev_open marked static and removed from fs.h 2008-02-08 09:22:42 -08:00
compat_binfmt_elf.c x86: compat_binfmt_elf 2008-01-30 13:31:46 +01:00
compat_ioctl.c d_path: Make d_path() use a struct path 2008-02-14 21:17:09 -08:00
compat.c Merge branch 'linus_origin' into hotfixes 2008-02-15 13:36:30 -05:00
dcache.c dentries: Extract common code to remove dentry from lru 2008-02-14 21:17:09 -08:00
dcookies.c d_path: Make d_path() use a struct path 2008-02-14 21:17:09 -08:00
direct-io.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
dnotify.c
dquot.c Introduce path_put() 2008-02-14 21:13:33 -08:00
drop_caches.c
eventfd.c fs/eventfd.c should #include <linux/syscalls.h> 2008-02-06 10:41:03 -08:00
eventpoll.c lockdep: annotate epoll 2008-02-05 09:44:07 -08:00
exec.c Allow ARG_MAX execve string space even with a small stack limit 2008-03-03 10:12:14 -08:00
fcntl.c fs: remove fastcall, it is always empty 2008-02-08 09:22:31 -08:00
fifo.c
file_table.c fs: remove fastcall, it is always empty 2008-02-08 09:22:31 -08:00
file.c get rid of NR_OPEN and introduce a sysctl_nr_open 2008-02-06 10:41:06 -08:00
filesystems.c
fs-writeback.c writeback: speed up writeback of big dirty files 2008-02-05 09:44:19 -08:00
generic_acl.c
inode.c iget: remove iget() and the read_inode() super op as being obsolete 2008-02-07 08:42:29 -08:00
inotify_user.c Introduce path_put() 2008-02-14 21:13:33 -08:00
inotify.c inotify: remove debug code 2008-02-06 10:41:07 -08:00
internal.h
ioctl.c fix up kerneldoc in fs/ioctl.c a little bit 2008-02-09 11:08:33 -08:00
ioprio.c cfq-iosched: relax IOPRIO_CLASS_IDLE restrictions 2008-01-28 11:38:15 +01:00
Kconfig deprecate smbfs in favour of cifs 2008-02-05 14:37:15 -08:00
Kconfig.binfmt aout: suppress A.OUT library support if !CONFIG_ARCH_SUPPORTS_AOUT 2008-02-08 09:22:30 -08:00
libfs.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
locks.c Pidns: make full use of xxx_vnr() calls 2008-02-08 09:22:29 -08:00
Makefile x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
mbcache.c
mpage.c docbook: fix filesystems.tmpl source files 2008-03-03 10:47:13 -08:00
namei.c Use struct path in fs_struct 2008-02-14 21:13:33 -08:00
namespace.c d_path: Make seq_path() use a struct path argument 2008-02-14 21:17:08 -08:00
nfsctl.c Introduce path_put() 2008-02-14 21:13:33 -08:00
no-block.c
open.c Make set_fs_{root,pwd} take a struct path 2008-02-14 21:13:33 -08:00
pipe.c kernel-doc: fix fs/pipe.c notation 2008-02-13 16:21:19 -08:00
pnode.c MNT_UNBINDABLE fix 2008-02-06 10:41:02 -08:00
pnode.h
posix_acl.c
quota_v1.c
quota_v2.c
quota.c Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p) 2008-02-07 08:42:26 -08:00
read_write.c remove the unused exports of sys_open/sys_read 2008-02-08 09:22:36 -08:00
read_write.h
readdir.c Use mutex_lock_killable in vfs_readdir 2007-12-06 17:39:54 -05:00
select.c make sys_poll() wait at least timeout ms 2008-02-06 10:41:09 -08:00
seq_file.c d_path: Make d_path() use a struct path 2008-02-14 21:17:09 -08:00
signalfd.c fs/signalfd.c should #include <linux/syscalls.h> 2008-02-06 10:41:03 -08:00
splice.c splice: only return -EAGAIN if there's hope of more data 2008-03-04 11:14:39 +01:00
stack.c
stat.c Introduce path_put() 2008-02-14 21:13:33 -08:00
super.c LSM/SELinux: Interfaces to allow FS to control mount options 2008-03-06 08:40:53 +11:00
sync.c
timerfd.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
utimes.c Introduce path_put() 2008-02-14 21:13:33 -08:00
xattr_acl.c
xattr.c Introduce path_put() 2008-02-14 21:13:33 -08:00