In UP systems, the idle task is initialized using the init_task
structure from which the command name is taken (currently "swapper").
In SMP systems, one idle task per CPU is forked by the worker thread
from which the task structure is copied. The command name is, therefore,
"kworker/0:0" or "kworker/0:1", if not updated. Since such update was
lacking, all idle tasks in SMP systems were incorrectly named. This
longtime bug was not discovered immediately, because there is no /proc/0
entry - the bug only becomes apparent when tracing is enabled.
This patch sets the command name of the idle tasks in SMP systems to the
name that is used in the INIT_TASK structure suffixed by a slash and the
number of the CPU.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111026211708.768925506@osadl.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The return-value convention for these functions varies depending on
whether they're interruptible or can timeout. It can be a little
confusing--document it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111006192246.GB28026@fieldses.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
powerpc/p3060qds: Add support for P3060QDS board
powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
powerpc/85xx: Make kexec to interate over online cpus
powerpc/fsl_booke: Fix comment in head_fsl_booke.S
powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
powerpc/86xx: Correct Gianfar support for GE boards
powerpc/cpm: Clear muram before it is in use.
drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
powerpc/fsl_msi: add support for "msi-address-64" property
powerpc/85xx: Setup secondary cores PIR with hard SMP id
powerpc/fsl-booke: Fix settlbcam for 64-bit
powerpc/85xx: Adding DCSR node to dtsi device trees
powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
powerpc/85xx: fix PHYS_64BIT selection for P1022DS
powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
powerpc: respect mem= setting for early memory limit setup
powerpc: Update corenet64_smp_defconfig
powerpc: Update mpc85xx/corenet 32-bit defconfigs
...
Fix up trivial conflicts in:
- arch/powerpc/configs/40x/hcu4_defconfig
removed stale file, edited elsewhere
- arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
added opal and gelic drivers vs added ePAPR driver
- drivers/tty/serial/8250.c
moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
llist: Add back llist_add_batch() and llist_del_first() prototypes
sched: Don't use tasklist_lock for debug prints
sched: Warn on rt throttling
sched: Unify the ->cpus_allowed mask copy
sched: Wrap scheduler p->cpus_allowed access
sched: Request for idle balance during nohz idle load balance
sched: Use resched IPI to kick off the nohz idle balance
sched: Fix idle_cpu()
llist: Remove cpu_relax() usage in cmpxchg loops
sched: Convert to struct llist
llist: Add llist_next()
irq_work: Use llist in the struct irq_work logic
llist: Return whether list is empty before adding in llist_add()
llist: Move cpu_relax() to after the cmpxchg()
llist: Remove the platform-dependent NMI checks
llist: Make some llist functions inline
sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
sched: Remove redundant test in check_preempt_tick()
sched: Add documentation for bandwidth control
sched: Return unused runtime on group dequeue
...
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
rcu: Wire up RCU_BOOST_PRIO for rcutree
rcu: Make rcu_torture_boost() exit loops at end of test
rcu: Make rcu_torture_fqs() exit loops at end of test
rcu: Permit rt_mutex_unlock() with irqs disabled
rcu: Avoid having just-onlined CPU resched itself when RCU is idle
rcu: Suppress NMI backtraces when stall ends before dump
rcu: Prohibit grace periods during early boot
rcu: Simplify unboosting checks
rcu: Prevent early boot set_need_resched() from __rcu_pending()
rcu: Dump local stack if cannot dump all CPUs' stacks
rcu: Move __rcu_read_unlock()'s barrier() within if-statement
rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
rcu: Make rcu_implicit_dynticks_qs() locals be correct size
rcu: Eliminate in_irq() checks in rcu_enter_nohz()
nohz: Remove nohz_cpu_mask
rcu: Document interpretation of RCU-lockdep splats
rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
Avoid taking locks from debug prints, this avoids latencies on -rt,
and improves reliability of the debug code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently every sched_class::set_cpus_allowed() implementation has to
copy the cpumask into task_struct::cpus_allowed, this is pointless,
put this copy in the generic code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This task is preparatory for the migrate_disable() implementation, but
stands on its own and provides a cleanup.
It currently only converts those sites required for task-placement.
Kosaki-san once mentioned replacing cpus_allowed with a proper
cpumask_t instead of the NR_CPUS sized array it currently is, that
would also require something like this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).
For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).
As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.
Thus leading to poor system utilization.
Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.
Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.
1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI
to cpu-B to process the call single queue is not yet sent.
2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.
3. cpu-A is busy with lot of interrupts
4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.
5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.
6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.
7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().
Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.
To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.
[ This issue is present from 2.6.35+ kernels, but marking it -stable
only from v3.0+ as the proposed fix depends on the scheduler_ipi()
that is introduced recently. ]
Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org # v3.0+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On -rt we observed hackbench waking all 400 tasks to a single cpu.
This is because of select_idle_sibling()'s interaction with the new
ipi based wakeup scheme.
The existing idle_cpu() test only checks to see if the current task on
that cpu is the idle task, it does not take already queued tasks into
account, nor does it take queued to be woken tasks into account.
If the remote wakeup IPIs come hard enough, there won't be time to
schedule away from the idle task, and would thus keep thinking the cpu
was in fact idle, regardless of the fact that there were already
several hundred tasks runnable.
We couldn't reproduce on mainline, but there's no reason it couldn't
happen.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the generic llist primitives.
We had a private lockless list implementation in the scheduler in the wake-list
code, now that we have a generic llist implementation that provides all required
operations, switch to it.
This patch is not expected to change any behavior.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David reported:
Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.
Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.
I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).
For example:
[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread: before(0.000081692) after(0.498316431) diff(498234739)
self: before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$
The diff of 'process' should always be >= the diff of 'thread'.
I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.
---
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
static pthread_barrier_t barrier;
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1)
__asm__ __volatile__("" : : : "memory");
return NULL;
}
int main(void)
{
clockid_t process_clock, my_thread_clock, th_clock;
struct timespec process_before, process_after;
struct timespec me_before, me_after;
struct timespec th_before, th_after;
struct timespec sleeptime;
unsigned long diff;
pthread_t th;
int err;
err = clock_getcpuclockid(0, &process_clock);
if (err)
return 1;
err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
if (err)
return 1;
pthread_barrier_init(&barrier, NULL, 2);
err = pthread_create(&th, NULL, chew_cpu, NULL);
if (err)
return 1;
err = pthread_getcpuclockid(th, &th_clock);
if (err)
return 1;
pthread_barrier_wait(&barrier);
err = clock_gettime(process_clock, &process_before);
if (err)
return 1;
err = clock_gettime(my_thread_clock, &me_before);
if (err)
return 1;
err = clock_gettime(th_clock, &th_before);
if (err)
return 1;
sleeptime.tv_sec = 0;
sleeptime.tv_nsec = 500000000;
nanosleep(&sleeptime, NULL);
err = clock_gettime(th_clock, &th_after);
if (err)
return 1;
err = clock_gettime(my_thread_clock, &me_after);
if (err)
return 1;
err = clock_gettime(process_clock, &process_after);
if (err)
return 1;
diff = process_after.tv_nsec - process_before.tv_nsec;
printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
process_before.tv_sec, process_before.tv_nsec,
process_after.tv_sec, process_after.tv_nsec, diff);
diff = th_after.tv_nsec - th_before.tv_nsec;
printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
th_before.tv_sec, th_before.tv_nsec,
th_after.tv_sec, th_after.tv_nsec, diff);
diff = me_after.tv_nsec - me_before.tv_nsec;
printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
me_before.tv_sec, me_before.tv_nsec,
me_after.tv_sec, me_after.tv_nsec, diff);
return 0;
}
This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.
This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().
Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
RCU no longer uses this global variable, nor does anyone else. This
commit therefore removes this variable. This reduces memory footprint
and also removes some atomic instructions and memory barriers from
the dyntick-idle path.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Long ago, using TREE_RCU with PREEMPT would result in "scheduling
while atomic" diagnostics if you blocked in an RCU read-side critical
section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
this diagnostic. This commit therefore adds a replacement diagnostic
based on PROVE_RCU.
Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
used for things that have nothing to do with rcu_dereference(), rename
lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
argument that is a string indicating what is suspicious. This third
argument is passed in from a new third argument to rcu_lockdep_assert().
Update all calls to rcu_lockdep_assert() to add an informative third
argument.
Also, add a pair of rcu_lockdep_assert() calls from within
rcu_note_context_switch(), one complaining if a context switch occurs
in an RCU-bh read-side critical section and another complaining if a
context switch occurs in an RCU-sched read-side critical section.
These are present only if the PROVE_RCU kernel parameter is enabled.
Finally, fix some checkpatch whitespace complaints in lockdep.c.
Again, you must enable PROVE_RCU to see these new diagnostics. But you
are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit c259e01a1e ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.
Tested-by: Simon Kirby <sim@hostway.ca>
Cc: stable@kernel.org # 2.6.39+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We want to override the default value of SD_NODES_PER_DOMAIN on ppc64,
so move it into linux/topology.h.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
sched: Fix a memory leak in __sdt_free()
sched: Move blk_schedule_flush_plug() out of __schedule()
sched: Separate the scheduler entry for preemption
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:
$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s
Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s
That's a 37% penalty.
Note that pong is not even in the monitored cgroup.
The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
Performance counter stats for 'sleep 100':
CPU1 <not counted> cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz
The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.
The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.
With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s
Start perf stat with cgroup:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s
The penalty is now less than 2%.
And the results for perf stat are correct:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 <not counted> cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
Now perf stat reports the correct counts for
for the non cgroup event.
If we run pong inside the cgroup, then we also get the
correct counts:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
10.001457237 seconds time elapsed
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.
Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.
This fixes a deadlock in the scheduler where the
blk_schedule_flush_plug() callchain enables interrupts and thereby
allows a wakeup to happen of the task that's going to sleep.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.
To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.
We do this only when the quota is current and there is more than
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.
In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.
Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:
no quota generations:
197.47 ms /cgroup/a/cpuacct.usage
199.46 ms /cgroup/a/cpuacct.usage
205.46 ms /cgroup/a/cpuacct.usage
198.46 ms /cgroup/a/cpuacct.usage
208.39 ms /cgroup/a/cpuacct.usage
Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.
with quota generations [1s increments]:
119.58 ms /cgroup/a/cpuacct.usage
119.65 ms /cgroup/a/cpuacct.usage
119.64 ms /cgroup/a/cpuacct.usage
119.63 ms /cgroup/a/cpuacct.usage
119.60 ms /cgroup/a/cpuacct.usage
The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota. The cost is that this quota
becomes unavailable.
with quota generations and quota return:
200.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
198.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
200.06 ms /cgroup/a/cpuacct.usage
By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from
previous quota periods (especially on a large machine).
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.
The following exports are included:
nr_periods: number of periods in which execution occurred
nr_throttled: the number of periods above in which execution was throttle
throttled_time: cumulative wall-time that any cpus have been throttled for
this group
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task(). The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.
Resolve this by unthrottling offline cpus so that threads can be migrated.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From the perspective of load-balance and shares distribution, throttled
entities should be invisible.
However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present. In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.
Instead, track hierarchicaal throttled state at time of transition. This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.
Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.
We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At the start of each period we refresh the global bandwidth pool. At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).
Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.
Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled. A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.
Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.
We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.
One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.
Fortunately we can differentiate these by considering whether the
global deadline has advanced. If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.
Since the RT pool is using a similar timer there's some small refactoring to
share this support.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.
cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.
This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add constraints validation for CFS bandwidth hierarchies.
Validate that:
max(child bandwidth) <= parent_bandwidth
In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER. Some basic code from the RT case is re-factored
for reuse.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.
- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rqs can allocate from.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool bandwidth assigned to a specific cpu.
Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.
The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.
This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.
Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
been handled for an RT parent gives birth to a deranged mutant child with
non-RT policy, but RT prio and sched_class.
Move PI leakage protection up, always set priorities and weight, and if the
child is leaving RT class, reset rt_priority to the proper value.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since commit a2d47777 ("sched: fix stale value in average load per task")
the variable rq->avg_load_per_task is no longer required. Remove it.
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
sched: Replace use of entity_key()
sched: Separate group-scheduling code more clearly
sched: Reorder root_domain to remove 64 bit alignment padding
sched: Do not attempt to destroy uninitialized rt_bandwidth
sched: Remove unused function cpu_cfs_rq()
sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
sched, cgroup: Optimize load_balance_fair()
sched: Don't update shares twice on on_rq parent
sched: update correct entity's runtime in check_preempt_wakeup()
xtensa: Use generic config PREEMPT definition
h8300: Use generic config PREEMPT definition
m32r: Use generic PREEMPT config
sched: Skip autogroup when looking for all rt sched groups
sched: Simplify mutex_spin_on_owner()
sched: Remove rcu_read_lock() from wake_affine()
sched: Generalize sleep inside spinlock detection
sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
sched: Isolate preempt counting in its own config option
sched: Remove pointless in_atomic() definition check
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
perf: Remove the nmi parameter from the oprofile_perf backend
x86, perf: Make copy_from_user_nmi() a library function
perf: Remove perf_event_attr::type check
x86, perf: P4 PMU - Fix typos in comments and style cleanup
perf tools: Make test use the preset debugfs path
perf tools: Add automated tests for events parsing
perf tools: De-opt the parse_events function
perf script: Fix display of IP address for non-callchain path
perf tools: Fix endian conversion reading event attr from file header
perf tools: Add missing 'node' alias to the hw_cache[] array
perf probe: Support adding probes on offline kernel modules
perf probe: Add probed module in front of function
perf probe: Introduce debuginfo to encapsulate dwarf information
perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
perf probe: Remove redundant dwarf functions
perf probe: Move strtailcmp to string.c
perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
tracing/kprobe: Update symbol reference when loading module
tracing/kprobes: Support module init function probing
kprobes: Return -ENOENT if probe point doesn't exist
...
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix lockdep_no_validate against IRQ states
mutex: Make mutex_destroy() an inline function
plist: Remove the need to supply locks to plist heads
lockup detector: Fix reference to the non-existent CONFIG_DETECT_SOFTLOCKUP option
Clean up cfs/rt runqueue initialization by moving group scheduling
related code into the corresponding functions.
Also, keep group scheduling as an add-on, so that things are only done
additionally, i. e. remove the init_*_rq() calls from init_tg_*_entry().
(This removes a redundant initalization during sched_init()).
In case of group scheduling rt_rq->highest_prio.curr is now initialized
twice, but adding another #ifdef seems not worth it.
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1310661163-16606-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reorder root_domain to remove 8 bytes of alignment padding on 64 bit
builds, this shrinks the size from 1736 to 1728 bytes, therefore using
one fewer cachelines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1310726492.1977.5.camel@castor.rsk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a task group is to be created and alloc_fair_sched_group() fails,
then the rt_bandwidth of the corresponding task group is not yet
initialized. The caller, sched_create_group(), starts a clean up
procedure which calls free_rt_sched_group() which unconditionally
destroys the not yet initialized rt_bandwidth.
This crashes or hangs the system in lock_hrtimer_base(): UP systems
dereference a NULL pointer, while SMP systems loop endlessly on a
condition that cannot become true.
This patch simply avoids the destruction of rt_bandwidth when the
initialization code path was not reached.
(This was discovered by accident with a custom kernel modification.)
Signed-off-by: Bianca Lutz <sowilo@cs.tu-berlin.de>
Signed-off-by: Jan Schoenherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1310580816-10861-7-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>