Move documentation from semaphore.h to semaphore.c as requested by
Andrew Morton. Also reformat to kernel-doc style and add some more
notes about the implementation.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
By removing the negative values of 'count' and relying on the wait_list to
indicate whether we have any waiters, we can simplify the implementation
by removing the protection against an unlikely race condition. Thanks to
David Howells for his suggestions.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
ACPI currently emulates a timeout for semaphores with calls to
down_trylock and sleep. This produces horrible behaviour in terms of
fairness and excessive wakeups. Now that we have a unified semaphore
implementation, adding a real down_trylock is almost trivial.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Semaphores are no longer performance-critical, so a generic C
implementation is better for maintainability, debuggability and
extensibility. Thanks to Peter Zijlstra for fixing the lockdep
warning. Thanks to Harvey Harrison for pointing out that the
unlikely() was unnecessary.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
This way it checks if the clocks are synchronized between CPUs too.
This might be able to detect slowly drifting TSCs which only
go wrong over longer time.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Call
ts = &per_cpu(tick_cpu_sched, cpu);
and
cpu = smp_processor_id();
once instead of twice.
No functional change done, as changed code runs with local irq off.
Reduces source lines and text size (20bytes on x86_64).
[ akpm@linux-foundation.org: Build fix ]
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_hrtimers() uses double_spin_lock().
This is overcomplicated: except for migrate_hrtimers() we never take 2 locks
at once, and migrate_hrtimers() can use spin_lock_nested().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_timers() uses double_spin_lock().
This all is overcomplicated: except for migrate_timers() we never take 2 locks
at once, and migrate_timers() can use spin_lock_nested().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix sparse warnings like this:
kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one
kernel/posix-cpu-timers.c:1058:21: originally declared here
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add timer list annotations to workqueue.c so we can see the call site
in the timer stats.
Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert all the nanosleep related users of restart_block to the
new nanosleep specific restart_block fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Generic code is not supposed to include irq.h. Replace this include
> by linux/hardirq.h instead and add/replace an include of linux/irq.h
> in asm header files where necessary.
> This change should only matter for architectures that make use of
> GENERIC_CLOCKEVENTS.
> Architectures in question are mips, x86, arm, sh, powerpc, uml and sparc64.
>
> I did some cross compile tests for mips, x86_64, arm, powerpc and sparc64.
> This patch fixes also build breakages caused by the include replacement in
> tick-common.h.
I generally dislike adding optional linux/* includes in asm/* includes -
I'm nervous about this causing include loops.
However, there's a separate point to be discussed here.
That is, what interfaces are expected of every architecture in the kernel.
If generic code wants to be able to set the affinity of interrupts, then
that needs to become part of the interfaces listed in linux/interrupt.h
rather than linux/irq.h.
So what I suggest is this approach instead (against Linus' tree of a
couple of days ago) - we move irq_set_affinity() and irq_can_set_affinity()
to linux/interrupt.h, change the linux/irq.h includes to linux/interrupt.h
and include asm/irq_regs.h where needed (asm/irq_regs.h is supposed to be
rarely used include since not much touches the stacked parent context
registers.)
Build tested on ARM PXA family kernels and ARM's Realview platform
kernels which both use genirq.
[ tglx@linutronix.de: add GENERIC_HARDIRQ dependencies ]
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
When I cleaned up printk() and split up the printk locking logic in
commit 266c2e0abe ("Make printk() console
semaphore accesses sensible") I had incorrectly moved the call to
have_callable_console() outside of the console semaphore.
That was buggy. The console semaphore protects the console_drivers list
that is used by have_callable_console().
Thanks go to Bongani Hlope who saw this as a hang on shutdown and reboot
and bisected the bug to the right commit, and tested this patch. See
http://lkml.org/lkml/2008/4/11/315
Bisected-and-tested-by: Bongani Hlope <bonganilinux@mweb.co.za>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
revert "sched: fix fair sleepers" (e22ecef1d2),
because it is causing audio skipping, see:
http://bugzilla.kernel.org/show_bug.cgi?id=10428
the patch is correct and the real cause of the skipping is not
understood (tracing makes it go away), but time has run out so we'll
revert it and re-try in 2.6.26.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on
each line.
Currently this ID isn't really needed since a hierarchy can be completely
identified by the set of subsystems bound to it, but this is likely to change
in the near future in order to support stateless subsystems and
merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the
need for an API change later.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs. The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.
Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.
More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function. This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else. It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in a single hierarchy
- foo isn't visible as an individually mountable subsystem
As a result there will only ever be one call to foo->create(), at init time;
all processes will stay in this group, and the group will never be mounted on
a visible hierarchy. Any additional effects (e.g. not allocating metadata)
are up to the foo subsystem.
This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
but it could easily be extended to do so if any of the early_init systems
wanted it - I think it would just involve some nastier parameter processing
since it would occur before the command-line argument parser had been run.
Hugh said:
Ballpark figures, I'm trying to get this question out rather than
processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
to the affected paths, booting with cgroup_disable=memory cuts that back to
1% overhead (due to slightly bigger struct page).
I'm no expert on distros, they may have no interest whatever in
CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
without it, or apply the cgroup_disable=memory patches.
Unix bench's execl test result on x86_64 was
== just after boot without mounting any cgroup fs.==
mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0
==
[lizf@cn.fujitsu.com: fix boot option parsing]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Markers do not mix well with CONFIG_PREEMPT_RCU because it uses
preempt_disable/enable() and not rcu_read_lock/unlock for minimal
intrusiveness. We would need call_sched and sched_barrier primitives.
Currently, the modification (connection and disconnection) of probes
from markers requires changes to the data structure done in RCU-style :
a new data structure is created, the pointer is changed atomically, a
quiescent state is reached and then the old data structure is freed.
The quiescent state is reached once all the currently running
preempt_disable regions are done running. We use the call_rcu mechanism
to execute kfree() after such quiescent state has been reached.
However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier
does not guarantee that all preempt_disable code regions have finished,
hence the race.
The "proper" way to do this is to use rcu_read_lock/unlock, but we don't
want to use it to minimize intrusiveness on the traced system. (we do
not want the marker code to call into much of the OS code, because it
would quickly restrict what can and cannot be instrumented, such as the
scheduler).
The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in
mainline, is to use synchronize_sched before each call_rcu calls, so we
wait for the quiescent state in the system call code path. It will slow
down batch marker enable/disable, but will make sure the race is gone.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Silence two kerneldoc warnings.
Warning(kernel/audit.c:1276): No description found for parameter 'string'
Warning(kernel/audit.c:1276): No description found for parameter 'len'
[also fix a typo for bonus points]
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The futex init function is called init(). This is a pain in the neck
when debugging when you code dies in ... init :-)
This renames it to futex_init().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
NOHZ: reevaluate idle sleep length after add_timer_on()
clocksource: revert: use init_timer_deferrable for clocksource_watchdog
relay doesn't reference the pages it adds, however we need a non-NULL
hook or splice_to_pipe() can oops.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I found that relay files can be read by pread(2). I fix it,
for relay files are not capable of seeking.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.
To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.
Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.
Call this function from add_timer_on().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
--
include/linux/sched.h | 6 ++++++
kernel/sched.c | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/timer.c | 10 +++++++++-
3 files changed, 58 insertions(+), 1 deletion(-)
Revert
commit 1077f5a917
Author: Parag Warudkar <parag.warudkar@gmail.com>
Date: Wed Jan 30 13:30:01 2008 +0100
clocksource.c: use init_timer_deferrable for clocksource_watchdog
clocksource_watchdog can use a deferrable timer - reduces wakeups from
idle per second.
The watchdog timer needs to run with the specified interval. Otherwise
it will miss the possible wrap of the watchdog clocksource.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
The printk() logic on when/how to get the console semaphore was
unreadable, this splits the code up into a few helper functions and
makes it easier to follow what is going on.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case we're accounting from a sub-namespace, the tgids reported will not
refer to the right namespace.
Save the pid_namespace we're accounting in on the acct_glbs and use it in
do_acct_process.
Two less :) places using the task_struct.tgid member.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is minor, but dereferencing even current real_parent is not safe on debug
kernels, since the memory, this points to, can be unmapped - RCU protection is
required.
Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx
call (the 2nd patch), so RCU will be required anyway.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Paul pointed out, the ACCESS_ONCE are not needed because we already have
the explicit surrounding memory barriers.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add comments requested by Andrew.
Updated comments about synchronize_sched(). Since we use call_rcu and
rcu_barrier now, these comments were out of sync with the code.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The printk() can deadlock because it can wake up klogd(), and
task enqueueing will try to read the time in order to set a hrtimer.
Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Debugged-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Needed so it can be called from outside of sched.c.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Pavel Emelyanov <xemul@openvz.org>
This patch is based on the one from Thomas.
The kauditd_thread() calls the netlink_unicast() and passes
the audit_pid to it. The audit_pid, in turn, is received from
the user space and the tool (I've checked the audit v1.6.9)
uses getpid() to pass one in the kernel. Besides, this tool
doesn't bind the netlink socket to this id, but simply creates
it allowing the kernel to auto-bind one.
That's the preamble.
The problem is that netlink_autobind() _does_not_ guarantees
that the socket will be auto-bound to the current pid. Instead
it uses the current pid as a hint to start looking for a free
id. So, in case of conflict, the audit messages can be sent
to a wrong socket. This can happen (it's unlikely, but can be)
in case some task opens more than one netlink sockets and then
the audit one starts - in this case the audit's pid can be busy
and its socket will be bound to another id.
The proposal is to introduce an audit_nlk_pid in audit subsys,
that will point to the netlink socket to send packets to. It
will most often be equal to audit_pid. The socket id can be
got from the skb's netlink CB right in the audit_receive_msg.
The audit_nlk_pid reset to 0 is not required, since all the
decisions are taken based on audit_pid value only.
Later, if the audit tools will bind the socket themselves, the
kernel will have to provide a way to setup the audit_nlk_pid
as well.
A good side effect of this patch is that audit_pid can later
be converted to struct pid, as it is not longer safe to use
pid_t-s in the presence of pid namespaces. But audit code still
uses the tgid from task_struct in the audit_signal_info and in
the audit_filter_syscall.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert commit 1ada5cba6a ("clocksource:
make clocksource watchdog cycle through online CPUs") due to the
regression reported by Gabriel C at
http://lkml.org/lkml/2008/2/24/281
(short vesion: it makes TSC be marked as always unstable on his
machine).
Cc: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Hancock <hancockr@shaw.ca>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.
Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.
( Also slightly move the preempt_check within try_to_wake_up() - this has
no effect on functionality but allows 'early wakeups' (for still-on-rq
tasks) to be correctly accounted as well.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
split out the affine-wakeup bits.
No code changed:
kernel/sched.o:
text data bss dec hex filename
42521 2858 232 45611 b22b sched.o.before
42521 2858 232 45611 b22b sched.o.after
md5:
9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm
09b31c44e9aff8666f72773dc433e2df sched.o.after.asm
(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If subbuf_pages was larger than the max number of pages the pipe
buffer will hold, subbuf_splice_actor() would happily go beyond
the array size.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.
It also improves code size:
text data bss dec hex filename
42659 2740 144 45543 b1e7 sched.o.before
42093 2740 144 44977 afb1 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fair sleepers need to scale their latency target down by runqueue
weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.
Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.
Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clear the cached inverse value when updating load. This is needed for
calc_delta_mine() to work correctly when using the rq load.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.
min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.
The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.
Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.
There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.
Here is a possible scenario:
CPU0 CPU1
| schedule()
| ->deactivate_task()
| ->idle_balance()
| -->load_balance_newidle()
rt_mutex_setprio() |
| --->double_lock_balance()
*get lock *rel lock
* on_rq=0, ruuning=1 |
* sched_class is changed |
*rel lock *get lock
: |
:
->put_prev_task_rt()
->pick_next_task_fair()
=> panic
The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).
Update references to moved filenames.
Fix some trailing whitespace.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
There is a problem in the hibernation code that triggers on some NUMA
systems on which pfn_valid() returns 'true' for some PFNs that don't
belong to any zone. Namely, there is a BUG_ON() in
memory_bm_find_bit() that triggers for PFNs not belonging to any
zone and passing the pfn_valid() test. On the affected systems it
triggers when we mark PFNs reported by the platform as not saveable,
because the PFNs in question belong to a region mapped directly using
iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
test.
Modify memory_bm_find_bit() so that it returns an error if given PFN
doesn't belong to any zone instead of crashing the kernel and ignore
the result returned by it in mark_nosave_pages(), while marking the
"nosave" memory regions.
This doesn't affect the hibernation functionality, as we won't touch
the PFNs in question anyway.
http://bugzilla.kernel.org/show_bug.cgi?id=9966 .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map. This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.
The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.
So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original preemptible-RCU patch put the choice between classic and
preemptible RCU into kernel/Kconfig.preempt, which resulted in build failures
on machines not supporting CONFIG_PREEMPT. This choice was therefore moved to
init/Kconfig, which worked, but placed the choice between classic and
preemptible RCU at the top level, a very obtuse choice indeed.
This patch changes from the Kconfig "choice" mechanism to a pair of booleans,
only one of which (CONFIG_PREEMPT_RCU) is user-visible, and is located in
kernel/Kconfig.preempt, where one would expect it to be. The other
(CONFIG_CLASSIC_RCU) is in init/Kconfig so that it is available to all
architectures, hopefully avoiding build breakage. Thanks to Roman Zippel for
suggesting this approach.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return value convention of module's init functions is 0/-E. Sometimes,
e.g. during forward-porting mistakes happen and buggy module created,
where result of comparison "workqueue != NULL" is propagated all the way up
to sys_init_module. What happens is that some other module created
workqueue in question, our module created it again and module was
successfully loaded.
Or it could be some other bug.
Let's make such mistakes much more visible. In retrospective, such
messages would noticeably shorten some of my head-scratching sessions.
Note, that dump_stack() is just a way to get attention from user. Sample
message:
sys_init_module: 'foo'->init suspiciously returned 1, it should follow 0/-E convention
sys_init_module: loading module anyway...
Pid: 4223, comm: modprobe Not tainted 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #5
Call Trace:
[<ffffffff80254b05>] sys_init_module+0xe5/0x1d0
[<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c9a3ba55 (module: wait for dependent modules doing init.) didn't quite
work because the waiter holds the module lock, meaning that the state of the
module it's waiting for cannot change.
Fortunately, it's fairly simple to update the state outside the lock and do
the wakeup.
Thanks to Jan Glauber for tracking this down and testing (qdio and qeth).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
time: remove obsolete CLOCK_TICK_ADJUST
time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer()
time: prevent the loop in timespec_add_ns() from being optimised away
ntp: use unsigned input for do_div()
We currently set the root-domain online span automatically when the
domain is added to the cpu if the cpu is already a member of
cpu_online_map.
This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning. The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.
This solves the hung reboot regression reported by Andrew Morton and
others.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first version of the ntp_interval/tick_length inconsistent usage patch was
recently merged as bbe4d18ac2http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bbe4d18ac2e058c56adb0cd71f49d9ed3216a405
While the fix did greatly improve the situation, it was correctly pointed out
by Roman that it does have a small bug: If the users change clocksources after
the system has been running and NTP has made corrections, the correctoins made
against the old clocksource will be applied against the new clocksource,
causing error.
The second attempt, which corrects the issue in the NTP_INTERVAL_LENGTH
definition has also made it up-stream as commit
e13a2e61ddhttp://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e13a2e61dd5152f5499d2003470acf9c838eab84
Roman has correctly pointed out that CLOCK_TICK_ADJUST is calculated
based on the PIT's frequency, and isn't really relevant to non-PIT
driven clocksources (that is, clocksources other then jiffies and pit).
This patch reverts both of those changes, and simply removes
CLOCK_TICK_ADJUST.
This does remove the granularity error correction for users of PIT and Jiffies
clocksource users, but the granularity error but for the majority of users, it
should be within the 500ppm range NTP can accommodate for.
For systems that have granularity errors greater then 500ppm, the
"ntp_tick_adj=" boot option can be used to compensate.
[johnstul@us.ibm.com: provided changelog]
[mattilinnanvuori@yahoo.com: maek ntp_tick_adj static]
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Silences WARN_ONs in rcu_enter_nohz() and rcu_exit_nohz(), which appeared
before caused by (repeated) calls to:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: johnstul@us.ibm.com
Cc: Rafael Wysocki <rjw@sisk.pl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel NTP code shouldn't hand 64-bit *signed* values to do_div(). Make it
instead hand 64-bit unsigned values. This gets rid of a couple of warnings.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In commit ee7c82da83 ("wait_task_stopped:
simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
cast when storing si_code was lost in wait_task_stopped. This leaks the
in-kernel CLD_* values that do not match what userland expects.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch checks if we can set the rt_runtime_us to 0. If there is a
realtime task in the group, we don't want to set the rt_runtime_us as 0
or bad things will happen. (that task wont get any CPU time despite
being TASK_RUNNNG)
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
it was only possible to configure the rt-group scheduling parameters
beyond the default value in a very small range.
that's because div64_64() has a different calling convention than
do_div() :/
fix a few untidies while we are here; sysctl_sched_rt_period may overflow
due to that multiplication, so cast to u64 first. Also that RUNTIME_INF
juggling makes little sense although its an effective NOP.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Function sys_sched_rr_get_interval returns wrong time slice value for
SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sripathi Kodi reported a crash in the -rt kernel:
https://bugzilla.redhat.com/show_bug.cgi?id=435674
this is due to a place that can reschedule a task without holding
the tasks runqueue lock. This was caused by the RT balancing code
that pulls RT tasks to the current run queue and will reschedule the
current task.
There's a slight chance that the pulling of the RT tasks will release
the current runqueue's lock and retake it (in the double_lock_balance).
During this time that the runqueue is released, the current task can
migrate to another runqueue.
In the prio_changed_rt code, after the pull, if the current task is of
lesser priority than one of the RT tasks pulled, resched_task is called
on the current task. If the current task had migrated in that small
window, resched_task will be called without holding the runqueue lock
for the runqueue that the task is on.
This race condition also exists in the mainline kernel and this patch
adds a check to make sure the task hasn't migrated before calling
resched_task.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tested-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.
Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.
Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.
[ This changed in commit 04c19fa6f1
("cpuset: migrate all tasks in cpuset at once") when the mm migration
was moved from cpuset_update_task_memory_state() to update_nodemask() ]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A change after 2.6.24 broke ndiswrapper by accidentally removing its
access to GPL-only symbols. Revert that change and add comments about
the reasons why ndiswrapper and driverloader are treated in a special
way.
Signed-off-by: Pavel Roskin <proski@gnu.org>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug in regiseter_kretprobe() which does not check rp->kp.symbol_name ==
NULL before calling kprobe_lookup_name.
For maintainability, this introduces kprobe_addr helper function which
resolves addr field. It is used by register_kprobe and register_kretprobe.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_marker() may return NULL, so test for it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant
architectures with kprobes support. This facilitates easy handling of
in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on
kretprobes being present in the kernel.
Thanks to Sam Ravnborg for helping make the patch more lean.
Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory controller has a requirement that while writing values, we need
to use echo -n. This patch fixes the problem and makes the UI more consistent.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation says the default value of notify_on_release of a child
cgroup is inherited from its parent, which is reasonable, but the
implementation just sets the flag disabled.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following commits cause a number of regressions:
commit 58e2d4ca58
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduling, change how cpu load is calculated
commit 6b2d770026
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
Namely:
- very frequent wakeups on SMP, reported by PowerTop users.
- cacheline trashing on (large) SMP
- some latencies larger than 500ms
While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).
Hence we revert them and try again in v2.6.26.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This changes the "freezer" code used by suspend/hibernate in its treatment
of tasks in TASK_STOPPED (job control stop) and TASK_TRACED (ptrace) states.
As I understand it, the intent of the "freezer" is to hold all tasks
from doing anything significant. For this purpose, TASK_STOPPED and
TASK_TRACED are "frozen enough". It's possible the tasks might resume
from ptrace calls (if the tracer were unfrozen) or from signals
(including ones that could come via timer interrupts, etc). But this
doesn't matter as long as they quickly block again while "freezing" is
in effect. Some minor adjustments to the signal.c code make sure that
try_to_freeze() very shortly follows all wakeups from both kinds of
stop. This lets the freezer code safely leave stopped tasks unmolested.
Changing this fixes the longstanding bug of seeing after resuming from
suspend/hibernate your shell report "[1] Stopped" and the like for all
your jobs stopped by ^Z et al, as if you had freshly fg'd and ^Z'd them.
It also removes from the freezer the arcane special case treatment for
ptrace'd tasks, which relied on intimate knowledge of ptrace internals.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
should do this only when the whole process exits.
2. exit_notify() uses "current" as "ignored_task", obviously wrong.
Use ->group_leader instead.
Test case:
void hup(int sig)
{
printf("HUP received\n");
}
void *tfunc(void *arg)
{
sleep(2);
printf("sub-thread exited\n");
return NULL;
}
int main(int argc, char *argv[])
{
if (!fork()) {
signal(SIGHUP, hup);
kill(getpid(), SIGSTOP);
exit(0);
}
pthread_t thr;
pthread_create(&thr, NULL, tfunc, NULL);
sleep(1);
printf("main thread exited\n");
syscall(__NR_exit, 0);
return 0;
}
output:
main thread exited
HUP received
Hangup
With this patch the output is:
main thread exited
sub-thread exited
HUP received
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
p->exit_state != 0 doesn't mean this process is dead, it may have
sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)"
instead.
Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
if the main thread has exited.
However, the new check is not perfect either. There is a window when
exit_notify() drops tasklist and before release_task(). Suppose that
the last (non-leader) thread exits. This means that entire group exits,
but thread_group_empty() is not true yet.
As Eric pointed out, is_global_init() is wrong as well, but I did not
dare to do other changes.
Just for the record, has_stopped_jobs() is absolutely wrong too. But we
can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.
Even with this patch ^Z doesn't play well with the dead main thread.
The task is stopped correctly but do_wait(WSTOPPED) won't see it. This
is another unrelated issue, will be (hopefully) fixed separately.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out the common code in reparent_thread() and exit_notify().
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hi,
While we are looking at the printk issue, I see that its printk'ing the EOE
(end of event) records which is really not something that we need in syslog.
Its really intended for the realtime audit event stream handled by the audit
daemon. So, lets avoid printk'ing that record type.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On the latest kernels if one was to load about 15 rules, set the failure
state to panic, and then run service auditd stop the kernel will panic.
This is because auditd stops, then the script deletes all of the rules.
These deletions are sent as audit messages out of the printk kernel
interface which is already known to be lossy. These will overun the
default kernel rate limiting (10 really fast messages) and will call
audit_panic(). The same effect can happen if a slew of avc's come
through while auditd is stopped.
This can be fixed a number of ways but this patch fixes the problem by
just not panicing if auditd is not running. We know printk is lossy and
if the user chooses to set the failure mode to panic and tries to use
printk we can't make any promises no matter how hard we try, so why try?
At least in this way we continue to get lost message accounting and will
eventually know that things went bad.
The other change is to add a new call to audit_log_lost() if auditd
disappears. We already pulled the skb off the queue and couldn't send
it so that message is lost. At least this way we will account for the
last message and panic if the machine is configured to panic. This code
path should only be run if auditd dies for unforeseen reasons. If
auditd closes correctly audit_pid will get set to 0 and we won't walk
this code path.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix the following compiler warning by using "%zu" as defined in C99.
CC kernel/auditsc.o
kernel/auditsc.c: In function 'audit_log_single_execve_arg':
kernel/auditsc.c:1074: warning: format '%ld' expects type 'long int', but
argument 4 has type 'size_t'
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes a potentially invalid access to a per-CPU variable in
rcu_process_callbacks().
This per-CPU access needs to be done in such a way as to guarantee that
the code using it cannot move to some other CPU before all uses of the
value accessed have completed. Even though this code is currently only
invoked from softirq context, which currrently cannot migrate to some
other CPU, life would be better if this code did not silently make such
an assumption.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes a oops encountered when doing hibernate/resume in presence of
PREEMPT_RCU.
The problem was that the code failed to disable preemption when
accessing a per-CPU variable. This is OK when called from code that
already has preemption disabled, but such is not the case from the
suspend/resume code path.
Reported-by: Dave Young <hidave.darkstar@gmail.com>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kthread_stop() can be called when a 'watchdog' thread is executing after
kthread_should_stop() but before set_task_state(TASK_INTERRUPTIBLE).
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
idle CPU will not progress the RCU through its grace period and a
synchronize_rcu my get stuck. Without this patch I have a box that will
not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine
with this patch.
This patch comes from the -rt kernel where it has been tested for
several months.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
Subject: lockdep: include all lock classes in all_lock_classes
lockdep: increase MAX_LOCK_DEPTH
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
latencytop: change /proc task_struct access method
latencytop: fix memory leak on latency proc file
latencytop: fix kernel panic while reading latency proc file
sched: add declaration of sched_tail to sched.h
sched: fix signedness warnings in sched.c
sched: clean up __pick_last_entity() a bit
sched: remove duplicate code from sched_fair.c
sched: make early bootup sched_clock() use safer
printk recursion detection prepends message to printk_buf and offsets
printk_buf when actual message is printed but it forgets to trim buffer
length accordingly. This can result in overrun in extreme cases. Fix it.
[ mingo@elte.hu:
bug was introduced by me via:
commit 32a7600668
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 25 21:07:58 2008 +0100
printk: make printk more robust by not allowing recursion
]
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add each lock class to the all_lock_classes list when it is
first registered.
Previously, lock classes were added to all_lock_classes when
the lock class was first used. Since one of the uses of the
list is to find unused locks, this didn't work well.
Signed-off-by: Dale Farnsworth <dale@farnsworth.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unsigned long values are always assigned to switch_count,
make it unsigned long.
kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3897:15: expected long *switch_count
kernel/sched.c:3897:15: got unsigned long *<noident>
kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3921:16: expected long *switch_count
kernel/sched.c:3921:16: got unsigned long *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pick_task_entity() duplicates existing code. This functionality can be
easily obtained using rb_last(). Avoid code duplication by using rb_last().
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do not call sched_clock() too early. Not only might rq->idle
not be set up - but pure per-cpu data might not be accessible
either.
this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y.
Tested-by: Tony Luck <tony.luck@gmail.com>
Acked-by: Tony Luck <tony.luck@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg Nesterov and others have pointed out that on some architectures,
the traditional sequence of
set_current_state(TASK_INTERRUPTIBLE);
if (CONDITION)
return;
schedule();
is racy wrt another CPU doing
CONDITION = 1;
wake_up_process(p);
because while set_current_state() has a memory barrier separating
setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION
variable, there is no such memory barrier on the wakeup side.
Now, wake_up_process() does actually take a spinlock before it reads and
sets the task state on the waking side, and on x86 (and many other
architectures) that spinlock is in fact equivalent to a memory barrier,
but that is not generally guaranteed. The write that sets CONDITION
could move into the critical region protected by the runqueue spinlock.
However, adding a smp_wmb() to before the spinlock should now order the
writing of CONDITION wrt the lock itself, which in turn is ordered wrt
the accesses within the spinlock (which includes the reading of the old
state).
This should thus close the race (which probably has never been seen in
practice, but since smp_wmb() is a no-op on x86, it's not like this will
make anything worse either on the most common architecture where the
spinlock already gave the required protection).
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The list head res->tasks gets initialized twice in find_css_set().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cgroup uses unsigned long for subsys bitops, not unsigned long long.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
opts.release_agent is not kfree()ed in all necessary places.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fix:
- comments about need_forkexit_callback
- comments about release agent
- typo and comment style, etc.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
functions inturn call add/sub_preempt_count(). So we need to refuse user from
inserting probe in to these functions.
This patch disallows user from probing add/sub_preempt_count().
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not all architectures implement futex_atomic_cmpxchg_inatomic(). The default
implementation returns -ENOSYS, which is currently not handled inside of the
futex guts.
Futex PI calls and robust list exits with a held futex result in an endless
loop in the futex code on architectures which have no support.
Fixing up every place where futex_atomic_cmpxchg_inatomic() is called would
add a fair amount of extra if/else constructs to the already complex code. It
is also not possible to disable the robust feature before user space tries to
register robust lists.
Compile time disabling is not a good idea either, as there are already
architectures with runtime detection of futex_atomic_cmpxchg_inatomic support.
Detect the functionality at runtime instead by calling
cmpxchg_futex_value_locked() with a NULL pointer from the futex initialization
code. This is guaranteed to fail, but the call of
futex_atomic_cmpxchg_inatomic() happens with pagefaults disabled.
On architectures, which use the asm-generic implementation or have a runtime
CPU feature detection, a -ENOSYS return value disables the PI/robust features.
On architectures with a working implementation the call returns -EFAULT and
the PI/robust features are enabled.
The relevant syscalls return -ENOSYS and the robust list exit code is blocked,
when the detection fails.
Fixes http://lkml.org/lkml/2008/2/11/149
Originally reported by: Lennart Buytenhek
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the futex init code fails to initialize the futex pseudo file system it
returns early without initializing the hash queues. Should the boot succeed
then a futex syscall which tries to enqueue a waiter on the hashqueue will
crash due to the unitilialized plist heads.
Initialize the hash queues before the filesystem.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the last step of hibernation in the "platform" mode (with the
help of ACPI) we use the suspend code, including the devices'
->suspend() methods, to prepare the system for entering the ACPI S4
system sleep state.
But at least for some devices the operations performed by the
->suspend() callback in that case must be different from its operations
during regular suspend.
For this reason, introduce the new PM event type PM_EVENT_HIBERNATE and
pass it to the device drivers' ->suspend() methods during the last phase
of hibernation, so that they can distinguish this case and handle it as
appropriate. Modify the drivers that handle PM_EVENT_SUSPEND in a
special way and need to handle PM_EVENT_HIBERNATE in the same way.
These changes are necessary to fix a hibernation regression related
to the i915 driver (ref. http://lkml.org/lkml/2008/2/22/488).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to Alexey for the testing and the fix of the fix.
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by
checking if the pages to be copied are marked as present in the
kernel mapping and temporarily marking them as present if that's not
the case. No functional modifications are introduced if
CONFIG_DEBUG_PAGEALLOC is unset.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
The default_disable() function was changed in commit:
76d2160147
genirq: do not mask interrupts by default
It removed the mask function in favour of the default delayed
interrupt disabling. Unfortunately this also broke the shutdown in
free_irq() when the last handler is removed from the interrupt for
those architectures which rely on the default implementations. Now we
can end up with a enabled interrupt line after the last handler was
removed, which can result in spurious interrupts.
Fix this by adding a default_shutdown function, which is only
installed, when the irqchip implementation does provide neither a
shutdown nor a disable function.
[@stable: affected versions: .21 - .24 ]
Pointed-out-by: Michael Hennerich <Michael.Hennerich@analog.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Tested-by: Michael Hennerich <Michael.Hennerich@analog.com>
The functions time_before, time_before_eq, time_after, and
time_after_eq are more robust for comparing jiffies against other
values.
So following patch implements usage of the time_after() macro, defined
at linux/jiffies.h, which deals with wrapping correctly
Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Clearly this was supposed to be an == not an = in the if statement.
This patch also causes us to stop processing execve args once we have
failed rather than continuing to loop on failure over and over and over.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Relative expiry time can get negative, so it should be signed.
Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to
reflect this.
[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
audit_log_d_path() is a d_path() wrapper that is used by the audit code. To
use a struct path in audit_log_d_path() I need to embed it into struct
avc_audit_data.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Use struct path in fs_struct.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This test seems to be unnecessary since we always have rootfs mounted before
calling a usermodehelper.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A CLOCK_REALTIME timer, which has an absolute expiry time less than
the clock realtime offset calls with a negative delta into the clock
events code and triggers the WARN_ON() there.
This is a false positive and needs to be prevented. Check the result
of timer->expires - timer->base->offset right away and return -ETIME
right away.
Thanks to Frans Pop, who reported the problem and tested the fixes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Frans Pop <elendil@planet.nl>
Various user space callers ask for relative timeouts. While we fixed
that overflow issue in hrtimer_start(), the sites which convert
relative user space values to absolute timeouts themself were uncovered.
Instead of putting overflow checks into each place add a function
which does the sanity checking and convert all affected callers to use
it.
Thanks to Frans Pop, who reported the problem and tested the fixes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Frans Pop <elendil@planet.nl>
RCU style multiple probes support for the Linux Kernel Markers. Common case
(one probe) is still fast and does not require dynamic allocation or a
supplementary pointer dereference on the fast path.
- Move preempt disable from the marker site to the callback.
Since we now have an internal callback, move the preempt disable/enable to the
callback instead of the marker site.
Since the callback change is done asynchronously (passing from a handler that
supports arguments to a handler that does not setup the arguments is no
arguments are passed), we can safely update it even if it is outside the
preempt disable section.
- Move probe arm to probe connection. Now, a connected probe is automatically
armed.
Remove MARK_MAX_FORMAT_LEN, unused.
This patch modifies the Linux Kernel Markers API : it removes the probe
"arm/disarm" and changes the probe function prototype : it now expects a
va_list * instead of a "...".
If we want to have more than one probe connected to a marker at a given
time (LTTng, or blktrace, ssytemtap) then we need this patch. Without it,
connecting a second probe handler to a marker will fail.
It allow us, for instance, to do interesting combinations :
Do standard tracing with LTTng and, eventually, to compute statistics
with SystemTAP, or to have a special trigger on an event that would call
a systemtap script which would stop flight recorder tracing.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Reported-by: Miles Lane <miles.lane@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fastcall always expands to empty, remove it.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This comment caused some consternation during fastcall removal. Make it
truthful.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
sched: rt-group: refure unrunnable tasks
sched: rt-group: clean up the ifdeffery
sched: rt-group: make rt groups scheduling configurable
sched: rt-group: interface
sched: rt-group: deal with PI
sched: fix incorrect irq lock usage in normalize_rt_tasks()
sched: fair-group: separate tg->shares from task_group_lock
hrtimer: more hrtimer_init_sleeper() fallout.
Refuse to accept or create RT tasks in groups that can't run them.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up some of the excessive ifdeffery introduces in the last patch.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the rt group scheduler compile time configurable.
Keep it experimental for now.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.
Extend the /sys/kernel/uids/<uid>/ interface to allow setting
the group's rt_runtime.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven mentioned the fun case where a lock holding task will be throttled.
Simple fix: allow groups that have boosted tasks to run anyway.
If a runnable task in a throttled group gets boosted the dequeue/enqueue
done by rt_mutex_setprio() is enough to unthrottle the group.
This is ofcourse not quite correct. Two possible ways forward are:
- second prio array for boosted tasks
- boost to a prio ceiling (this would also work for deadline scheduling)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called
from hardirq context through sysrq-n
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The USEC_TO_HZ and HZ_TO_USEC constant sets were mislabelled, with
seriously incorrect results. This among other things manifested
itself as cpufreq not working when a tickless kernel was configured.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Tested-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hrtimer_nanosleep_restart() clears/restores restart_block->fn. This is
pointless and complicates its usage. Note that if sys_restart_syscall()
doesn't actually happen, we have a bogus "pending" restart->fn anyway,
this is harmless.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Spotted by Pavel Emelyanov and Alexey Dobriyan.
compat_sys_nanosleep() implicitly uses hrtimer_nanosleep_restart(), this can't
work. Make a suitable compat_nanosleep_restart() helper.
Introduced by commit c70878b4e0
hrtimer: hook compat_sys_nanosleep up to high res timer code
Also, set ->addr_limit = KERNEL_DS before doing hrtimer_nanosleep(), this func
was changed by the previous patch and now takes the "__user *" parameter.
Thanks to Ingo Molnar for fixing the bug in this patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Spotted by Pavel Emelyanov and Alexey Dobriyan.
hrtimer_nanosleep() sets restart_block->arg1 = rmtp, but this rmtp points to
the local variable which lives in the caller's stack frame. This means that
if sys_restart_syscall() actually happens and it is interrupted as well, we
don't update the user-space variable, but write into the already dead stack
frame.
Introduced by commit 04c227140f
hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier
Change the callers to pass "__user *rmtp" to hrtimer_nanosleep(), and change
hrtimer_nanosleep() to use copy_to_user() to actually update *rmtp.
Small problem remains. man 2 nanosleep states that *rtmp should be written if
nanosleep() was interrupted (it says nothing whether it is OK to update *rmtp
if nanosleep returns 0), but (with or without this patch) we can dirty *rem
even if nanosleep() returns 0.
NOTE: this patch doesn't change compat_sys_nanosleep(), because it has other
bugs. Fixed by the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
include/linux/hrtimer.h | 2 -
kernel/hrtimer.c | 51 +++++++++++++++++++++++++-----------------------
kernel/posix-timers.c | 14 +------------
3 files changed, 30 insertions(+), 37 deletions(-)
clocksource initialization and error accumulation. This corrects a 280ppm
drift seen on some systems using acpi_pm, and affects other clocksources as
well (likely to a lesser degree).
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>