kernel-ark/kernel
Eric Dumazet 34f01cc1f5 FUTEX: new PRIVATE futexes
Analysis of current linux futex code :
  --------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
	(rollback is value != expected_value, returns EWOULDBLOCK)
	(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part  may be done
   by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

  This means scalability problems if many processes/threads want to use
  futexes at the same time.
  This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

  Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
  ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

  mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
  Highly threaded processes might suffer from mmap_sem contention.

  mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
  programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
  It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
  because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace.  As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).  We
chose to force all futexes be 'shared'.  This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented.  Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
prepared to fallback on old subcommands for old kernels.  Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net>
Cc: "Ulrich Drepper" <drepper@gmail.com>
Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
..
irq Fix Linuxdoc comment 2007-05-09 12:30:48 -07:00
power PM: Separate hibernation code from suspend code 2007-05-09 12:30:48 -07:00
time Fix printk format warnings in timer_list.c 2007-05-09 12:30:50 -07:00
.gitignore gitignore: ignore more generated files 2006-01-03 11:35:26 +01:00
acct.c [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_path 2006-12-08 08:28:42 -08:00
audit.c audit: add spaces on either side of case "..." operator. 2007-05-08 11:15:09 -07:00
audit.h [PATCH] audit: AUDIT_PERM support 2006-09-11 13:32:30 -04:00
auditfilter.c [PATCH] minor update to rule add/delete messages (ver 2) 2007-02-17 21:30:09 -05:00
auditsc.c [PATCH] fix deadlock in audit_log_task_context() 2007-03-14 15:27:48 -07:00
capability.c [PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task 2007-02-12 09:48:32 -08:00
compat.c [PATCH] Common compat_sys_sysinfo 2007-02-11 10:51:32 -08:00
configs.c use simple_read_from_buffer in kernel/ 2007-05-09 12:30:49 -07:00
cpu.c Remove kthread_bind() call from _cpu_down() 2007-05-09 12:30:53 -07:00
cpuset.c use simple_read_from_buffer in kernel/ 2007-05-09 12:30:49 -07:00
delayacct.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
die_notifier.c move die notifier handling to common code 2007-05-08 11:15:04 -07:00
dma.c [PATCH] struct seq_operations and struct file_operations constification 2006-12-07 08:39:46 -08:00
exec_domain.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
exit.c change kernel threads to ignore signals instead of blocking them 2007-05-09 12:30:53 -07:00
extable.c [PATCH] symbol_put_addr() locks kernel 2006-05-15 11:20:55 -07:00
fork.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
futex_compat.c futex_requeue_pi optimization 2007-05-09 12:30:55 -07:00
futex.c FUTEX: new PRIVATE futexes 2007-05-09 12:30:55 -07:00
hrtimer.c export hrtimer_forward 2007-05-08 11:15:15 -07:00
itimer.c The scheduled -EINVAL for invalid timevals in setitimer 2007-05-08 11:15:13 -07:00
kallsyms.c kallsyms: cleanup: use seq_release_private() where appropriate 2007-05-08 11:15:09 -07:00
Kconfig.hz [PATCH] HZ: 300Hz support 2006-12-07 08:39:36 -08:00
Kconfig.preempt [PATCH] sched: voluntary kernel preemption 2005-06-25 16:24:45 -07:00
kexec.c kdump/kexec: calculate note size at compile time 2007-05-08 11:15:07 -07:00
kfifo.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
kmod.c wait_for_helper: remove unneeded do_sigaction() 2007-05-09 12:30:53 -07:00
kprobes.c Kprobes: The ON/OFF knob thru debugfs 2007-05-08 11:15:19 -07:00
ksysfs.c remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
kthread.c change kernel threads to ignore signals instead of blocking them 2007-05-09 12:30:53 -07:00
latency.c [PATCH] severing module.h->sched.h 2006-12-04 02:00:22 -05:00
lockdep_internals.h [PATCH] lockdep: more chains 2006-12-07 08:39:43 -08:00
lockdep_proc.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
lockdep.c lockdep: removed unused ip argument in mark_lock & mark_held_locks 2007-05-08 11:15:13 -07:00
Makefile move die notifier handling to common code 2007-05-08 11:15:04 -07:00
module.c Fix race between cat /proc/slab_allocators and rmmod 2007-05-08 11:15:08 -07:00
mutex-debug.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
mutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
mutex.c [PATCH] lockdep: avoid lockdep warning in md 2006-12-08 08:28:39 -08:00
mutex.h [PATCH] lockdep: prove mutex locking correctness 2006-07-03 15:27:04 -07:00
nsproxy.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
panic.c [PATCH] Add TAINT_USER and ability to set taint flags from userspace 2007-02-11 10:51:29 -08:00
params.c kernel/params.c: fix lying comment for param_array() 2007-05-08 11:15:08 -07:00
pid.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
posix-cpu-timers.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
posix-timers.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
printk.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
profile.c [PATCH] proc: remove useless (and buggy) ->nlink settings 2007-02-11 10:51:32 -08:00
ptrace.c [PATCH] pidspace: is_init() 2006-09-29 09:18:12 -07:00
rcupdate.c [PATCH] rcu: add a prefetch() in rcu_do_batch() 2006-12-07 08:39:40 -08:00
rcutorture.c rcutorture: Remove redundant assignment to cur_ops in for loop 2007-05-08 11:15:17 -07:00
relay.c relay: use plain timer instead of delayed work 2007-05-09 12:30:51 -07:00
resource.c libata/IDE: remove combined mode quirk 2007-04-28 14:15:59 -04:00
rtmutex_common.h futex_requeue_pi optimization 2007-05-09 12:30:55 -07:00
rtmutex-debug.c Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
rtmutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rtmutex-tester.c [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
rtmutex.c futex_requeue_pi optimization 2007-05-09 12:30:55 -07:00
rtmutex.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rwsem.c Lockdep treats down_write_trylock like regular down_write 2007-05-08 11:15:09 -07:00
sched.c Eliminate lock_cpu_hotplug in kernel/schedc 2007-05-09 12:30:51 -07:00
seccomp.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
signal.c change kernel threads to ignore signals instead of blocking them 2007-05-09 12:30:53 -07:00
softirq.c [PATCH] tick-management: dyntick / highres functionality 2007-02-16 08:13:59 -08:00
softlockup.c add touch_all_softlockup_watchdogs() 2007-05-08 11:15:06 -07:00
spinlock.c [PATCH] lockdep: spin_lock_irqsave_nested() 2006-11-25 13:28:34 -08:00
srcu.c [PATCH] SRCU: report out-of-memory errors 2006-10-04 07:55:30 -07:00
stacktrace.c [PATCH] lockdep: stacktrace subsystem, core 2006-07-03 15:27:02 -07:00
stop_machine.c Use stop_machine_run in the Intel RNG driver 2007-05-08 11:15:00 -07:00
sys_ni.c [PATCH] Create compat_sys_migrate_pages 2006-11-03 12:27:59 -08:00
sys.c Extend notifier_call_chain to count nr_calls made 2007-05-09 12:30:51 -07:00
sysctl.c proc: maps protection 2007-05-08 11:15:02 -07:00
taskstats.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
time.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
timer.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
tsacct.c [PATCH] time: x86_64: split x86_64/kernel/time.c up 2007-02-16 08:14:00 -08:00
uid16.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
user.c [PATCH] slab: remove kmem_cache_t 2006-12-07 08:39:25 -08:00
utsname_sysctl.c [PATCH] sysctl: remove insert_at_head from register_sysctl 2007-02-14 08:09:59 -08:00
utsname.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
wait.c [PATCH] uninline init_waitqueue_head() 2006-07-10 13:24:25 -07:00
workqueue.c make cancel_rearming_delayed_work() reliable 2007-05-09 12:30:53 -07:00