kernel-ark/kernel
Steven Rostedt c92211d9b7 sched/cpupri: Remove the vec->lock
sched/cpupri: Remove the vec->lock

The cpupri vec->lock has been showing up as a top contention
lately. This is because of the RT push/pull logic takes an
agressive approach for migrating RT tasks. The cpupri logic is
in place to improve the performance of the push/pull when dealing
with large number CPU machines.

The problem though is a vec->lock is required, where a vec is a
global per RT priority structure. That is, if there are lots of
RT tasks at the same priority, every time they are added or removed
from the RT queue, this global vec->lock is taken. Now that more
kernel threads are becoming RT (RCU boost and threaded interrupts)
this is becoming much more of an issue.

There are two variables that are being synced by the vec->lock.
The cpupri bitmask, and the vec->counter. The cpupri bitmask
is one bit per priority. If a RT priority vec has a process queued,
then the vec->count is > 0 and the cpupri bitmask is set for that
RT priority.

If the cpupri bitmask gets out of sync with the vec->counter, we could
end up pushing a low proirity RT task to a high priority queue.
That RT task that could have run immediately could be queued on a
run queue with a higher priority task indefinitely.

The solution is not to use the cpupri bitmask and just look at the
vec->count directly when doing a pull. The cpupri bitmask is just
a fast way to scan the RT priorities when a pull is made. Instead
of using the bitmask, and just examine all RT priorities, and
look at the vec->counts, we could eliminate the vec->lock. The
scan of RT tasks is to find a run queue that we can push an RT task
to, and we do not push to a high priority queue, thus the scan only
needs to go from 1 to RT task->prio, and not all 100 RT priorities.

The push algorithm, which does the scan of RT priorities (and
scan of the bitmask) only happens when we have an overloaded RT run
queue (more than one RT task queued). The grabbing of the vec->lock
happens every time any RT task is queued or dequeued on the run
queue for that priority. The slowing down of the scan by not using
a bitmask is negligible by the speed up of removing the vec->lock
contention, and replacing it with an atomic counter and memory barrier.

To prove this, I wrote a patch that times both the loop and the code
that grabs the vec->locks. I passed the patches to various people
(and companies) to test and show the results. I let everyone choose
their own load to test, giving different loads on the system,
for various different setups.

Here's some of the results: (snipping to a few CPUs to not make
this change log huge, but the results were consistent across
the entire system).

System 1 (24 CPUs)

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
[...]
cpu 20: loop    3057    1.766   0.061   0.642   1963.170
        vec     6782949 90.469  0.089   0.414   2811760.503
cpu 21: loop    2617    1.723   0.062   0.641   1679.074
        vec     6782810 90.499  0.089   0.291   1978499.900
cpu 22: loop    2212    1.863   0.063   0.699   1547.160
        vec     6767244 85.685  0.089   0.435   2949676.898
cpu 23: loop    2320    2.013   0.062   0.594   1380.265
        vec     6781694 87.923  0.088   0.431   2928538.224

After patch:
cpu 20: loop    2078    1.579   0.061   0.533   1108.006
        vec     6164555 5.704   0.060   0.143   885185.809
cpu 21: loop    2268    1.712   0.065   0.575   1305.248
        vec     6153376 5.558   0.060   0.187   1154960.469
cpu 22: loop    1542    1.639   0.095   0.533   823.249
        vec     6156510 5.720   0.060   0.190   1172727.232
cpu 23: loop    1650    1.733   0.068   0.545   900.781
        vec     6170784 5.533   0.060   0.167   1034287.953

All times are in microseconds. The 'loop' is the amount of time spent
doing the loop across the priorities (before patch uses bitmask).
the 'vec' is the amount of time in the code that requires grabbing
the vec->lock. The second patch just does not have the vec lock, but
encompasses the same code.

Amazingly the loop code even went down on average. The vec code went
from .5 down to .18, that's more than half the time spent!

Note, more than one test was run, but they all had the same results.

System 2 (64 CPUs)

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 60: loop    0       0       0       0       0
        vec     5410840 277.954 0.084   0.782   4232895.727
cpu 61: loop    0       0       0       0       0
        vec     4915648 188.399 0.084   0.570   2803220.301
cpu 62: loop    0       0       0       0       0
        vec     5356076 276.417 0.085   0.786   4214544.548
cpu 63: loop    0       0       0       0       0
        vec     4891837 170.531 0.085   0.799   3910948.833

After patch:
cpu 60: loop    0       0       0       0       0
        vec     5365118 5.080   0.021   0.063   340490.267
cpu 61: loop    0       0       0       0       0
        vec     4898590 1.757   0.019   0.071   347903.615
cpu 62: loop    0       0       0       0       0
        vec     5737130 3.067   0.021   0.119   687108.734
cpu 63: loop    0       0       0       0       0
        vec     4903228 1.822   0.021   0.071   348506.477

The test run during the measurement did not have any (very few,
from other CPUs) RT tasks pushing. But this shows that it helped
out tremendously with the contention, as the contention happens
because the vec->lock is taken only on queuing at an RT priority,
and different CPUs that queue tasks at the same priority will
have contention.

I tested on my own 4 CPU machine with the following results:

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 0:  loop    2377    1.489   0.158   0.588   1398.395
        vec     4484    770.146 2.301   4.396   19711.755
cpu 1:  loop    2169    1.962   0.160   0.576   1250.110
        vec     4425    152.769 2.297   4.030   17834.228
cpu 2:  loop    2324    1.749   0.155   0.559   1299.799
        vec     4368    779.632 2.325   4.665   20379.268
cpu 3:  loop    2325    1.629   0.157   0.561   1306.113
        vec     4650    408.782 2.394   4.348   20222.577

After patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 0:  loop    2121    1.616   0.113   0.636   1349.189
        vec     4303    1.151   0.225   0.421   1811.966
cpu 1:  loop    2130    1.638   0.178   0.644   1372.927
        vec     4627    1.379   0.235   0.428   1983.648
cpu 2:  loop    2056    1.464   0.165   0.637   1310.141
        vec     4471    1.311   0.217   0.433   1937.927
cpu 3:  loop    2154    1.481   0.162   0.601   1295.083
        vec     4236    1.253   0.230   0.425   1803.008

This was running my migrate.c code that can be found at:
http://lwn.net/Articles/425763/

The migrate code does stress the RT tasks a bit. This shows that
the loop did increase a little after the patch, but not by much.
The vec code dropped dramatically. From 4.3us down to .42us.
That's a 10x improvement!

Tested-by: Mike Galbraith <mgalbraith@suse.de>
Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14 12:01:03 +02:00
..
debug Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb 2011-08-01 13:39:40 -10:00
events perf: Remove perf_event_attr::type check 2011-07-21 20:41:55 +02:00
gcov gcov: disable CONSTRUCTORS for UML 2011-07-26 16:49:45 -07:00
irq Merge branch 'imx/dt' into next/dt 2011-07-28 15:25:46 +00:00
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-07-25 13:56:39 -07:00
time Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-07-22 16:52:18 -07:00
trace Merge branch 'linus' into perf/urgent 2011-08-05 10:33:55 +02:00
.gitignore
acct.c
async.c async: Fixed an include coding style issue 2011-06-14 22:48:46 -04:00
audit_tree.c audit_tree,rcu: Convert call_rcu(__put_tree) to kfree_rcu() 2011-07-20 14:10:11 -07:00
audit_watch.c kill path_lookup() 2011-03-14 09:15:23 -04:00
audit.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
audit.h
auditfilter.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
auditsc.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c Merge branch 'master' into next 2011-05-19 18:51:57 +10:00
cgroup_freezer.c cgroups: add per-thread subsystem callbacks 2011-05-26 17:12:34 -07:00
cgroup.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2011-07-27 19:26:38 -07:00
compat.c Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2011-07-30 00:08:53 -07:00
configs.c kernel/configs.c: include MODULE_*() when CONFIG_IKCONFIG_PROC=n 2011-07-25 20:57:15 -07:00
cpu.c Fix common misspellings 2011-03-31 11:26:23 -03:00
cpuset.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
crash_dump.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
cred.c move RLIMIT_NPROC check from set_user() to do_execve_common() 2011-08-11 11:24:42 -07:00
delayacct.c KVM: Steal time implementation 2011-07-14 12:59:14 +03:00
dma.c
elfcore.c
exec_domain.c
exit.c ipc: introduce shm_rmid_forced sysctl 2011-07-26 16:49:44 -07:00
extable.c extable, core_kernel_data(): Make sure all archs define _sdata 2011-05-20 08:56:56 +02:00
fork.c move RLIMIT_NPROC check from set_user() to do_execve_common() 2011-08-11 11:24:42 -07:00
freezer.c Freezer: Use SMP barriers 2011-05-17 23:19:17 +02:00
futex_compat.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
futex.c Merge branch 'linus' into core/urgent 2011-08-04 09:09:27 +02:00
groups.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
hrtimer.c hrtimers: Fix typo causing erratic timers 2011-05-25 15:31:58 -07:00
hung_task.c watchdog, hung_task_timeout: Add Kconfig configurable default 2011-04-28 09:13:17 +02:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump_label: Fix jump_label update for modules 2011-06-29 09:59:17 -04:00
kallsyms.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-25 17:52:22 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks arch:Kconfig.locks Remove unused config option. 2011-04-10 17:01:05 +02:00
Kconfig.preempt sched: Isolate preempt counting in its own config option 2011-06-10 15:15:40 +02:00
kexec.c treewide: Convert uses of struct resource to resource_size(ptr) 2011-06-10 14:55:36 +02:00
kfifo.c
kmod.c Boot up with usermodehelper disabled 2011-08-03 22:03:29 -10:00
kprobes.c kprobes: Return -ENOENT if probe point doesn't exist 2011-07-15 15:11:47 -04:00
ksysfs.c kernel/ksysfs.c: expose file_caps_enabled in sysfs 2011-04-19 16:45:51 -07:00
kthread.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
latencytop.c Fix common misspellings 2011-03-31 11:26:23 -03:00
lockdep_internals.h
lockdep_proc.c lockdep: Remove unused 'factor' variable from lockdep_stats_show() 2011-03-23 13:54:47 +01:00
lockdep_states.h
lockdep.c lockdep: Clear whole lockdep_map on initialization 2011-08-04 10:17:56 +02:00
Makefile jump label: Reduce the cycle count by changing the link order 2011-08-05 23:57:33 +02:00
module.c module: add /sys/module/<name>/uevent files 2011-07-24 22:06:04 +09:30
mutex-debug.c mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c lockdep, mutex: provide mutex_lock_nest_lock 2011-05-25 08:39:17 -07:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c notifiers: sys: move reboot notifiers into reboot.h 2011-07-25 20:57:14 -07:00
nsproxy.c make sure that nsproxy_cache is initialized early enough 2011-07-20 01:44:07 -04:00
padata.c Fix common misspellings 2011-03-31 11:26:23 -03:00
panic.c panic: panic=-1 for immediate reboot 2011-07-26 16:49:45 -07:00
params.c module: add /sys/module/<name>/uevent files 2011-07-24 22:06:04 +09:30
pid_namespace.c pidns: call pid_ns_prepare_proc() from create_pid_namespace() 2011-03-23 19:46:58 -07:00
pid.c rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check 2011-07-08 22:21:58 +02:00
pm_qos_params.c plist: Remove the need to supply locks to plist heads 2011-07-08 14:02:53 +02:00
posix-cpu-timers.c hrtimers: Avoid touching inactive timer bases 2011-05-23 13:59:54 +02:00
posix-timers.c posix-timers: RCU conversion 2011-05-24 12:10:51 +02:00
printk.c cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warning 2011-08-09 18:22:22 -07:00
profile.c kernel/profile.c: remove some duplicate code from profile_hits() 2011-05-26 17:12:37 -07:00
ptrace.c connector: add an event for monitoring process tracers 2011-07-18 21:38:33 +02:00
range.c
rcupdate.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
rcutiny_plugin.h rcu: Converge TINY_RCU expedited and normal boosting 2011-05-05 23:16:58 -07:00
rcutiny.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
rcutorture.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
rcutree_plugin.h softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
rcutree_trace.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
rcutree.c rcu: Prevent RCU callbacks from executing before scheduler initialized 2011-07-13 08:17:56 -07:00
rcutree.h rcu: Move RCU_BOOST #ifdefs to header file 2011-06-16 16:12:05 -07:00
relay.c
res_counter.c memcg: res_counter_read_u64(): fix potential races on 32-bit machines 2011-03-23 19:46:22 -07:00
resource.c resources: Add lookup_resource() 2011-07-30 21:21:39 +02:00
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.h
rtmutex-tester.c rtmutex: tester: Remove the remaining BKL leftovers 2011-02-22 22:07:22 +01:00
rtmutex.c plist: Remove the need to supply locks to plist heads 2011-07-08 14:02:53 +02:00
rtmutex.h
rwsem.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
sched_autogroup.c Fix common misspellings 2011-03-31 11:26:23 -03:00
sched_autogroup.h sched: Skip autogroup when looking for all rt sched groups 2011-07-01 10:39:08 +02:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c sched/cpupri: Remove the vec->lock 2011-08-14 12:01:03 +02:00
sched_cpupri.h sched/cpupri: Remove the vec->lock 2011-08-14 12:01:03 +02:00
sched_debug.c sched: Get rid of lock_depth 2011-04-24 13:18:38 +02:00
sched_fair.c sched: Remove noop in lowest_flag_domain() 2011-08-14 12:00:46 +02:00
sched_features.h sched: Kill WAKEUP_PREEMPT 2011-08-14 12:00:41 +02:00
sched_idletask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
sched_rt.c sched: Use pushable_tasks to determine next highest prio 2011-08-14 12:00:55 +02:00
sched_stats.h sched: More sched_domain iterations fixes 2011-05-28 17:02:54 +02:00
sched_stoptask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
sched.c sched: fix broken SCHED_RESET_ON_FORK handling 2011-08-14 12:00:43 +02:00
seccomp.c
semaphore.c
signal.c signals: sys_ssetmask/sys_rt_sigsuspend should use set_current_blocked() 2011-07-27 12:53:36 -07:00
smp.c generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts 2011-06-17 10:17:12 +02:00
softirq.c softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
spinlock.c
srcu.c rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status 2011-01-14 04:56:49 -08:00
stacktrace.c stack_trace: Add weak save_stack_trace_regs() 2011-06-14 22:48:52 -04:00
stop_machine.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
sys_ni.c ipc: Add missing sys_ni entries for ipc/compat.c functions 2011-05-20 13:53:02 -07:00
sys.c move RLIMIT_NPROC check from set_user() to do_execve_common() 2011-08-11 11:24:42 -07:00
sysctl_binary.c open-style analog of vfs_path_lookup() 2011-03-14 09:15:28 -04:00
sysctl_check.c sysctl_check: drop dead code 2011-03-23 19:46:51 -07:00
sysctl.c sysctl,rcu: Convert call_rcu(free_head) to kfree 2011-07-20 14:10:18 -07:00
taskstats.c taskstats: add_del_listener() should ignore !valid listeners 2011-08-03 14:25:20 -10:00
test_kprobes.c
time.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-15 18:53:35 -07:00
timeconst.pl
timer.c timers: Consider slack value in mod_timer() 2011-06-03 15:02:32 +02:00
tracepoint.c jump label: Introduce static_branch() interface 2011-04-04 12:48:08 -04:00
tsacct.c
uid16.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
up.c
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
user-return-notifier.c Fix common misspellings 2011-03-31 11:26:23 -03:00
user.c userns: add a user_namespace as creator/owner of uts_namespace 2011-03-23 19:46:59 -07:00
utsname_sysctl.c
utsname.c ns proc: Add support for the uts namespace 2011-05-10 14:35:35 -07:00
wait.c Fix common misspellings 2011-03-31 11:26:23 -03:00
watchdog.c perf, x86: P4 PMU - Introduce event alias feature 2011-07-14 17:25:04 -04:00
workqueue_sched.h
workqueue.c Merge branch 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2011-07-22 15:07:15 -07:00