kernel-ark/kernel
Hidetoshi Seto 0cf55e1ec0 sched, cputime: Introduce thread_group_times()
This is a real fix for problem of utime/stime values decreasing
described in the thread:

   http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

 - {u,s}time in task_struct are increased every time when the thread
   is interrupted by a tick (timer interrupt).

 - When a thread exits, its {u,s}time are added to signal->{u,s}time,
   after adjusted by task_times().

 - When all threads in a thread_group exits, accumulated {u,s}time
   (and also c{u,s}time) in signal struct are added to c{u,s}time
   in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

 - task_times(), to get cputime of a thread:
   This function returns adjusted values that originates from raw
   {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

 - thread_group_cputime(), to get cputime of a thread group:
   This function returns sum of all {u,s}time of living threads in
   the group, plus {u,s}time in the signal struct that is sum of
   adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

  group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

  group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

 - Modify thread's exit (= __exit_signal()) not to use task_times().
   It means {u,s}time in signal struct accumulates raw values instead
   of adjusted values.  As the result it makes thread_group_cputime()
   to return pure sum of "raw" values.

 - Introduce a new function thread_group_times(*task, *utime, *stime)
   that converts "raw" values of thread_group_cputime() to "adjusted"
   values, in same calculation procedure as task_times().

 - Modify group's exit (= wait_task_zombie()) to use this introduced
   thread_group_times().  It make c{u,s}time in signal struct to
   have adjusted values like before this patch.

 - Replace some thread_group_cputime() by thread_group_times().
   This replacements are only applied where conveys the "adjusted"
   cputime to users, and where already uses task_times() near by it.
   (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

 - Before this patch, if a group contains many short-life threads
   (e.g. runs 0.9ms and not interrupted by ticks), the group's
   cputime could be invisible since thread's cputime was accumulated
   after adjusted: imagine adjustment function as adj(ticks, runtime),
     {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
   After this patch it will not happen because the adjustment is
   applied after accumulated.

v2:
 - remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 17:32:40 +01:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq genirq: try_one_irq() must be called with irq disabled 2009-11-07 21:44:45 +01:00
power PM / Hibernate: Add newline to load_image() fail path 2009-11-03 11:03:09 +01:00
time headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
trace ftrace: Fix unmatched locking in ftrace_regex_write() 2009-11-04 01:42:10 -05:00
.gitignore
acct.c bsdacct: switch credentials for writing to the accounting file 2009-08-24 11:33:40 +10:00
async.c
audit_tree.c
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h
auditfilter.c
auditsc.c Audit: rearrange audit_context to save 16 bytes per struct 2009-09-24 03:50:26 -04:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
cgroup.c cgroup: fix strstrip() misuse 2009-10-29 07:39:25 -07:00
compat.c
configs.c
cpu.c Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-15 09:19:38 -07:00
cpuset.c cpumask: Partition_sched_domains takes array of cpumask_var_t 2009-11-04 13:16:40 +01:00
cred-internals.h
cred.c creds_are_invalid() needs to be exported for use by modules: 2009-09-23 11:02:26 -07:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c sched, cputime: Introduce thread_group_times() 2009-12-02 17:32:40 +01:00
extable.c
fork.c sched, cputime: Introduce thread_group_times() 2009-12-02 17:32:40 +01:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
futex.c futex: Fix spurious wakeup for requeue_pi really 2009-10-28 20:34:34 +01:00
groups.c
hrtimer.c Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-05 12:04:16 -07:00
hung_task.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
itimer.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
kallsyms.c kallsyms: use new arch_is_kernel_text() 2009-09-23 07:39:30 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.preempt
kexec.c kexec: fix omitting offset in extended crashkernel syntax 2009-07-29 19:10:34 -07:00
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c sched: Remove unused __schedule() declaration 2009-11-04 11:43:42 +01:00
kmod.c Revert "kmod: fix race in usermodehelper code" 2009-09-23 18:12:10 -07:00
kprobes.c const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
ksysfs.c
kthread.c sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c 2009-11-03 07:25:00 +01:00
latencytop.c
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c lockdep: Use cpu_clock() for lockstat 2009-10-09 15:56:44 +02:00
Makefile cgroups: move the cgroup debug subsys into cgroup.c to access internal state 2009-09-24 07:20:57 -07:00
module.c module: fix up CONFIG_KALLSYMS=n build. 2009-10-01 16:11:11 -07:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h
mutex.c
mutex.h
notifier.c
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c
panic.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-08 12:16:35 -07:00
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c perf events: Don't generate events for the idle task when exclude_idle is set 2009-10-23 09:35:02 +02:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c mm: also use alloc_large_system_hash() for the PID hash table 2009-09-22 07:17:38 -07:00
pm_qos_params.c
posix-cpu-timers.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
posix-timers.c time: Introduce CLOCK_REALTIME_COARSE 2009-08-21 21:43:46 +02:00
printk.c printk: add printk_delay to make messages readable for some scenarios 2009-09-23 07:39:28 -07:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee 2009-09-24 07:20:59 -07:00
rcupdate.c rcu: Move rcu_barrier() to rcutree 2009-10-07 08:11:20 +02:00
rcutorture.c rcu: Clean up code to address Ingo's checkpatch feedback 2009-09-23 19:46:30 +02:00
rcutree_plugin.h rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang 2009-10-15 20:33:01 +02:00
rcutree_trace.c rcu: Make hot-unplugged CPU relinquish its own RCU callbacks 2009-10-07 08:11:20 +02:00
rcutree.c rcu: Fix long-grace-period race between forcing and initialization 2009-11-02 16:06:21 +01:00
rcutree.h rcu: Fix long-grace-period race between forcing and initialization 2009-11-02 16:06:21 +01:00
relay.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c walk system ram range 2009-09-23 07:39:41 -07:00
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() 2009-08-06 05:50:21 +02:00
rtmutex.h
rwsem.c
sched_clock.c sched_clock: Fix atomicity/continuity bug by using cmpxchg64() 2009-09-30 22:56:10 +02:00
sched_cpupri.c sched: Add new prio to cpupri before removing old prio 2009-08-02 14:26:09 +02:00
sched_cpupri.h
sched_debug.c sched: Rate-limit newidle 2009-11-04 19:13:48 +01:00
sched_fair.c Merge branch 'sched/urgent' into sched/core 2009-11-26 10:50:42 +01:00
sched_features.h sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_idletask.c sched: Simplify sys_sched_rr_get_interval() system call 2009-09-21 09:53:55 +02:00
sched_rt.c cpumask: Simplify sched_rt.c 2009-11-04 13:16:38 +01:00
sched_stats.h
sched.c sched, cputime: Introduce thread_group_times() 2009-12-02 17:32:40 +01:00
seccomp.c
semaphore.c
signal.c signals: inline __fatal_signal_pending 2009-09-24 07:21:01 -07:00
slow-work.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
smp.c cpumask: remove arch_send_call_function_ipi 2009-09-24 09:34:47 +09:30
softirq.c softirq: add BLOCK_IOPOLL to softirq_to_name 2009-09-17 15:53:44 -04:00
softlockup.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
spinlock.c locking: Allow arch-inlined spinlocks 2009-08-31 18:08:50 +02:00
srcu.c
stacktrace.c
stop_machine.c
sys_ni.c Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-09-24 15:13:11 -07:00
sys.c sched, cputime: Introduce thread_group_times() 2009-12-02 17:32:40 +01:00
sysctl_check.c sysctl: fix false positives when PROC_SYSCTL=n 2009-10-29 07:39:30 -07:00
sysctl.c Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-09-24 07:53:22 -07:00
taskstats.c genetlink: make netns aware 2009-07-12 14:03:27 -07:00
test_kprobes.c
time.c sched, time: Define nsecs_to_jiffies() 2009-11-26 12:59:20 +01:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-23 09:46:15 -07:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c
user.c uids: Prevent tear down race 2009-11-02 16:02:39 +01:00
utsname_sysctl.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
utsname.c
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c Merge branch 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-10-29 08:20:00 -07:00