Commit Graph

16039 Commits

Author SHA1 Message Date
Juri Lelli
0c1061733a rtmutex: Document rt_mutex_adjust_prio_chain()
Parameters and usage of rt_mutex_adjust_prio_chain() are already
documented in Documentation/rt-mutex-design.txt. However, since this
function is called from several paths with different semantics (related
to the arguments), it is handy to have a quick reference directly in
the code.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1368608650-7935-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 09:23:52 +02:00
Stephane Eranian
2b923c8f5d perf/x86: Check branch sampling priv level in generic code
This patch moves commit 7cc23cd to the generic code:

 perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL

The check is now implemented in generic code instead of x86 specific
code. That way we do not have to repeat the test in each arch
supporting branch sampling.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20130521105337.GA2879@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 09:13:54 +02:00
Stephane Eranian
62b8563979 perf: Add sysfs entry to adjust multiplexing interval per PMU
This patch adds /sys/device/xxx/perf_event_mux_interval_ms to ajust
the multiplexing interval per PMU. The unit is milliseconds. Value has
to be >= 1.

In the 4th version, we renamed the sysfs file to be more consistent
with the other /proc/sys/kernel entries for perf_events.

In the 5th version, we handle the reprogramming of the hrtimer using
hrtimer_forward_now(). That way, we sync up to new timer value quickly
(suggested by Jiri Olsa).

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 09:13:51 +02:00
Stephane Eranian
9e6302056f perf: Use hrtimers for event multiplexing
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.

This patch changes the timer source for multiplexing.  It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.

The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.

In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated.  It is stopped when the PMU is not overcommitted.

In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().

The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 09:07:10 +02:00
Jiri Olsa
ab573844e3 perf: Fix hw breakpoints overflow period sampling
The hw breakpoint pmu 'add' function is missing the
period_left update needed for SW events.

The perf HW breakpoint events use the SW events framework
to process the overflow, so it needs to be properly initialized
in the PMU 'add' method.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1367421944-19082-5-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 08:59:54 +02:00
Paul Bolle
4eedb77a9c locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH"
The Kconfig symbols ARCH_INLINE_READ_UNLOCK_IRQ,
ARCH_INLINE_SPIN_UNLOCK_IRQ, and ARCH_INLINE_WRITE_UNLOCK_IRQ were added
in v2.6.33, but have never actually been used. Ingo Molnar spotted that
this is caused by three identical copy/paste erros. Eg, the Kconfig
entry for

    INLINE_READ_UNLOCK_IRQ

has an (optional) dependency on:

    ARCH_INLINE_READ_UNLOCK_BH

were it apparently should depend on:

    ARCH_INLINE_READ_UNLOCK_IRQ

instead. Likewise for the Kconfig entries for INLINE_SPIN_UNLOCK_IRQ and
INLINE_WRITE_UNLOCK_IRQ. Fix these three errors.

This never really caused any real problems as these symbols are set (or
unset) in a group - but it's worth fixing it nevertheless.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1368780693.1350.228.camel@x61.thuisdomein
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 08:50:00 +02:00
Ingo Molnar
d07e75a6e0 Merge branch 'sched/cleanups' into sched/core
Merge reason: these bits, formerly in sched/urgent, are too late for v3.10.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 08:16:02 +02:00
Randy Dunlap
387b8b3e37 auditfilter.c: fix kernel-doc warnings
Fix kernel-doc warnings in kernel/auditfilter.c:

  Warning(kernel/auditfilter.c:1029): Excess function parameter 'loginuid' description in 'audit_receive_filter'
  Warning(kernel/auditfilter.c:1029): Excess function parameter 'sessionid' description in 'audit_receive_filter'
  Warning(kernel/auditfilter.c:1029): Excess function parameter 'sid' description in 'audit_receive_filter'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Linus Torvalds
17fdfd0851 Masami Hiramatsu fixed another bug. This time returning a proper
result in event_enable_func(). After checking the return status
 of try_module_get(), it returned the status of try_module_get(). But
 try_module_get() returns 0 on failure, which is success for
 event_enable_func().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRnQn/AAoJEOdOSU1xswtMVDUH/3rOPX2/sbc817NN+KjXJiQi
 1O8tmiOcaqMh742Df2YWSqXeM5IjARjl/xSZqazpGaDVu6HnMbEeb3Frx9hpzOPu
 VEtBapasrPK6TOYSDfLaUuRsxuzSEsXR4dUexSh3o7f0/b1dY8x0BwiYxz3tz5BS
 x6HX9OptUXUKDrloNC0qlX7ymuWmaeGULsTgCYYORfMe2FRFfvJhoCZFgC6dLw5x
 YTubQuhVyNOD/X5jXM5h9kkUSw70VjGMhlqilyp0YLcnrhFL/QhCi7WR3b3hDwcp
 MUpJyMAaPXlQHs2Q/gh46XldyhULPXamrujx8ISsDDdMQlWsWPTsQgfJ8e4/zsQ=
 =n9S7
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Masami Hiramatsu fixed another bug.  This time returning a proper
  result in event_enable_func().  After checking the return status of
  try_module_get(), it returned the status of try_module_get().

  But try_module_get() returns 0 on failure, which is success for
  event_enable_func()"

* tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Return -EBUSY when event_enable_func() fails to get module
2013-05-24 10:46:55 -07:00
Tejun Heo
75501a6d59 cgroup: update iterators to use cgroup_next_sibling()
This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.

The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped.  Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle.  Comments are updated
accordingly.

This makes the iterators easier to use and will simplify controllers.

Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
2013-05-24 10:55:38 +09:00
Tejun Heo
53fa526174 cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.

It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.

This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.

This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.

Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.

v2: Typo fix as per Serge.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 10:55:38 +09:00
Tejun Heo
bdc7119f1b cgroup: make cgroup_is_removed() static
cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state.  Make it static.

While at it, make it return bool.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-24 10:55:38 +09:00
Tejun Heo
3f33e64f4a Merge branch 'for-3.10-fixes' into for-3.11
Merging to receive 7805d000db ("cgroup: fix a subtle bug in descendant
pre-order walk") so that further iterator updates can build upon it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-24 10:53:09 +09:00
Tejun Heo
7805d000db cgroup: fix a subtle bug in descendant pre-order walk
When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty.  This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path.  Remove it along with
the later assumption that the subtree isn't empty.  This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
2013-05-24 10:50:24 +09:00
Steven Rostedt (Red Hat)
ca1643186d tracing: Fix crash when ftrace=nop on the kernel command line
If ftrace=<tracer> is on the kernel command line, when that tracer is
registered, it will be initiated by tracing_set_tracer() to execute that
tracer.

The nop tracer is just a stub tracer that is used to have no tracer
enabled. It is assigned at early bootup as it is the default tracer.

But if ftrace=nop is on the kernel command line, the registering of the
nop tracer will call tracing_set_tracer() which will try to execute
the nop tracer. But it expects tr->current_trace to be assigned something
as it usually is assigned to the nop tracer. As it hasn't been assigned
to anything yet, it causes the system to crash.

The simple fix is to move the tr->current_trace = nop before registering
the nop tracer. The functionality is still the same as the nop tracer
doesn't do anything anyway.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-23 11:57:25 -04:00
Linus Torvalds
8f05bde9bd Kmemleak now scans all the writable and non-executable module sections
to avoid false positives (previously it was only scanning specific
 sections and missing .ref.data).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iQIcBAABAgAGBQJRlnWwAAoJEGvWsS0AyF7x5EEP/Rb1yjDKgw6RdBdHCAa4CokX
 gmOwTW2PVLuNqxu3JUE+GQHTAkZnLHFLzT4uSI4CAs0YdrdI+aCnJvy2jK6vZ0WC
 bD66hMxOoiJ4sZSA+mU19Zjwb1pRFWi9+sOrYgrC+AbYd45Y6psshn0kog+HslXa
 y2fv/VKqfSUMRJ+lB2p6jXcwxB1bFm4jcYM8OleKhdbb7QUkAZjftpg83hkTSq3n
 +eHQZxWTaeVubwFDmRQf2nPkixNrSI0ZTbOKgHUBJLvNAsxY2/eE3cvJY17NbfdH
 Hq9o7FmWPyRYrVHUzo5S0LbFev3tGUxLzc53G8DfajXRQbNwtAVhkpfX9vV2toH4
 ze/3dIMboUC+yRR9oH0pdRVwndq8oPtWAAKxfOrXKcm+jue2obtQDuswvEmtaufF
 ez10vF02doPZgjDeXKZY6hO2LeyjSh82opk4oSMmgsBjTBlsXxrelNAbMxHIiSnx
 SClCJUm+0PcAhxyehmOb2N95CmGi0sZd2Nwo0QAOBK/B3gyWxtz5qGQ0iADT5wsH
 fI2KLduzyEKNv5phF5Ct8BdA3p89J64/K9HDnV5dA1W8aRudE8fdEpOHlIb3mbn5
 NsMg4ahhKPOJM1IX+YBSCXxJapgBAumlXwfXzQi3CzUF0iBmRS2enybbLtR/fxU6
 5+o92idKeq9NQwhmGH7v
 =LeOH
 -----END PGP SIGNATURE-----

Merge tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64

Pull kmemleak patches from Catalin Marinas:
 "Kmemleak now scans all the writable and non-executable module sections
  to avoid false positives (previously it was only scanning specific
  sections and missing .ref.data)."

* tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64:
  kmemleak: No need for scanning specific module sections
  kmemleak: Scan all allocated, writeable and not executable module sections
2013-05-18 10:21:32 -07:00
Yinghai Lu
fbe06b7bae x86, range: fix missing merge during add range
Christian found v3.9 does not work with E350 with EFI is enabled.

[    1.658832] Trying to unpack rootfs image as initramfs...
[    1.679935] BUG: unable to handle kernel paging request at ffff88006e3fd000
[    1.686940] IP: [<ffffffff813661df>] memset+0x1f/0xb0
[    1.692010] PGD 1f77067 PUD 1f7a067 PMD 61420067 PTE 0

but early memtest report all memory could be accessed without problem.

early page table is set in following sequence:
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000] init_memory_mapping: [mem 0x6e600000-0x6e7fffff]
[    0.000000] init_memory_mapping: [mem 0x6c000000-0x6e5fffff]
[    0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff]
[    0.000000] init_memory_mapping: [mem 0x6e800000-0x6ea07fff]
but later efi_enter_virtual_mode try set mapping again wrongly.
[    0.010644] pid_max: default: 32768 minimum: 301
[    0.015302] init_memory_mapping: [mem 0x640c5000-0x6e3fcfff]
that means it fails with pfn_range_is_mapped.

It turns out that we have a bug in add_range_with_merge and it does not
merge range properly when new add one fill the hole between two exsiting
ranges. In the case when [mem 0x00100000-0x6bffffff] is the hole between
[mem 0x00000000-0x000fffff] and [mem 0x6c000000-0x6e7fffff].

Fix the add_range_with_merge by calling itself recursively.

Reported-by: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQVofGoSk7q5-0irjkBxemqK729cND4hov-1QCBJDhxpgQ@mail.gmail.com
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-17 11:49:10 -07:00
Steven Rostedt
89c837351d kmemleak: No need for scanning specific module sections
As kmemleak now scans all module sections that are allocated, writable
and non executable, there's no need to scan individual sections that
might reference data.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
2013-05-17 09:53:36 +01:00
Steven Rostedt
06c9494c0e kmemleak: Scan all allocated, writeable and not executable module sections
Instead of just picking data sections by name (names that start
with .data, .bss or .ref.data), use the section flags and scan all
sections that are allocated, writable and not executable. Which should
cover all sections of a module that might reference data.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[catalin.marinas@arm.com: removed unused 'name' variable]
[catalin.marinas@arm.com: collapsed 'if' blocks]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
2013-05-17 09:53:07 +01:00
Linus Torvalds
4a007ed926 Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Three more workqueue regression fixes.

   - Fix unbalanced unlock in trylock failure path of manage_workers().
     This shouldn't happen often in the wild but is possible.

   - While making schedule_work() and friends inline, they become
     unavailable to !GPL modules.  Allow !GPL modules to access basic
     stuff - system_wq and queue_*work_on() - so that schedule_work()
     and friends can be used.

   - During boot, the unbound NUMA support code allocates a cpumask for
     each possible node using alloc_cpumask_var_node(), which ends up
     trying to allocate node-specific memory even for offline nodes
     triggering BUG in the memory alloc code.  Use NUMA_NO_NODE for
     offline nodes."

* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init()
  workqueue: Make schedule_work() available again to non GPL modules
  workqueue: correct handling of the pool spin_lock
2013-05-16 12:03:28 -07:00
Linus Torvalds
ff89acc563 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
 "A couple of fixes for RCU regressions:

   - A boneheaded boolean-logic bug that resulted in excessive delays on
     boot, hibernation and suspend that was reported by Borislav Petkov,
     Bjørn Mork, and Joerg Roedel.  The fix inserts a single "!".

   - A fix for a boot-time splat due to allocating from bootmem too late
     in boot, fix courtesy of Sasha Levin with additional help from
     Yinghai Lu."

* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Don't allocate bootmem from rcu_init()
  rcu: Fix comparison sense in rcu_needs_cpu()
2013-05-16 12:02:07 -07:00
Oleg Nesterov
264b83c07a usermodehelper: check subprocess_info->path != NULL
argv_split(empty_or_all_spaces) happily succeeds, it simply returns
argc == 0 and argv[0] == NULL. Change call_usermodehelper_exec() to
check sub_info->path != NULL to avoid the crash.

This is the minimal fix, todo:

 - perhaps we should change argv_split() to return NULL or change the
   callers.

 - kill or justify ->path[0] check

 - narrow the scope of helper_lock()

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-By: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-16 12:01:11 -07:00
Masami Hiramatsu
6ed0106667 tracing: Return -EBUSY when event_enable_func() fails to get module
Since try_module_get() returns false( = 0) when it fails to
pindown a module, event_enable_func() returns 0 which means
"succeed". This can cause a kernel panic when the entry
is removed, because the event is already released.

This fixes the bug by returning -EBUSY, because the reason
why it fails is that the module is being removed at that time.

Link: http://lkml.kernel.org/r/20130516114848.13508.97899.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-16 11:01:16 -04:00
Tejun Heo
1be0c25da5 workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init()
wq_numa_init() builds per-node cpumasks which are later used to make
unbound workqueues NUMA-aware.  The cpumasks are allocated using
alloc_cpumask_var_node() for all possible nodes.  Unfortunately, on
machines with off-line nodes, this leads to NUMA-aware allocations on
existing bug offline nodes, which in turn triggers BUG in the memory
allocation code.

Fix it by using NUMA_NO_NODE for cpumask allocations for offline
nodes.

  kernel BUG at include/linux/gfp.h:323!
  invalid opcode: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
  Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
  task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
  RIP: 0010:[<ffffffff8117495d>]  [<ffffffff8117495d>] new_slab+0x2ad/0x340
  RSP: 0000:ffff880234603bf8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
  RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
  R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
  FS:  0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Stack:
   ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
   ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
   ffff880237816140 0000000000000000 ffff88023740e1c0
  Call Trace:
   [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2
   [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200
   [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90
   [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be
   [<ffffffff81a0bec8>] init_workqueues+0x64/0x341
   [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0
   [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec
   [<ffffffff815d50de>] kernel_init+0xe/0xf0
   [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0
  Code: 45  84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Lingzhu Xiang <lxiang@redhat.com>
2013-05-15 14:24:24 -07:00
Linus Torvalds
c240a539df This includes a fix to a memory leak when adding filters to traces.
Also, Masami Hiramatsu fixed up some minor bugs that were discovered
 by sparse.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRk+PYAAoJEOdOSU1xswtMgO4H/AlePQ4IEOfXgy43Xx8S7Uew
 FmvHYmUi5l6UuFRLxcVZBsnSqo3i43UGyj9Lm+g1wBwNP3IDEf+t+fVaN4UP18KW
 C+5OoEWyJLe4BlQoVsdBV1+lmivacG3uczvAPY6fibyTqbJN67uzufPLw8ruBMOo
 dIIXWUR1sKRyy9+SF23q0rwf5ZfuavlMnmeTZ6omk0evbVT2q0lbpUeN/V1Rrxmc
 ZUObTyoFvVMvPYnwutGh7+QB/o0LR9C6MPyyFTvz/o9bD9CDXdbTvSQatoYrbEm8
 9/0hpetm+KJpa2M9mf4djeXWCIFX3gBuQ156LEFhS68Ug+HDWQ7Yv9iJQmmR6OE=
 =ZToX
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes a fix to a memory leak when adding filters to traces.

  Also, Masami Hiramatsu fixed up some minor bugs that were discovered
  by sparse."

* tag 'trace-fixes-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Make print_*probe_event static
  tracing/kprobes: Fix a sparse warning for incorrect type in assignment
  tracing/kprobes: Use rcu_dereference_raw for tp->files
  tracing: Fix leaks of filter preds
2013-05-15 14:08:57 -07:00
Linus Torvalds
652df602f8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:

 - Fix for a task exit cleanup race caused by a missing a preempt
   disable

 - Cleanup of the event notification functions with a massive reduction
   of duplicated code

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Factor out auxiliary events notification
  perf: Fix EXIT event notification
2013-05-15 14:07:02 -07:00
Linus Torvalds
cc51bf6e6d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:

 - Cure for not using zalloc in the first place, which leads to random
   crashes with CPUMASK_OFF_STACK.

 - Revert a user space visible change which broke udev

 - Add a missing cpu_online early return introduced by the new full
   dyntick conversions

 - Plug a long standing race in the timer wheel cpu hotplug code.
   Sigh...

 - Cleanup NOHZ per cpu data on cpu down to prevent stale data on cpu
   up.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
  timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
  tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
  tick: Cleanup NOHZ per cpu data on cpu down
  tick: Use zalloc_cpumask_var for allocating offstack cpumasks
2013-05-15 14:05:17 -07:00
Linus Torvalds
37cae5e249 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:

 - Two fixlets for the fallout of the generic idle task conversion

 - Documentation update

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu/idle: Wrap cpu-idle poll mode within rcu_idle_enter/exit
  idle: Fix hlt/nohlt command-line handling in new generic idle
  kthread: Document ways of reducing OS jitter due to per-CPU kthreads
2013-05-15 14:04:00 -07:00
Masami Hiramatsu
b62fdd97fc tracing/kprobes: Make print_*probe_event static
According to sparse warning, print_*probe_event static because
those functions are not directly called from outside.

Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-15 13:50:24 -04:00
Masami Hiramatsu
3d1fc7b088 tracing/kprobes: Fix a sparse warning for incorrect type in assignment
Fix a sparse warning about the rcu operated pointer is
defined without __rcu address space.

Link: http://lkml.kernel.org/r/20130513115837.6545.23322.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-15 13:50:23 -04:00
Masami Hiramatsu
c02c7e65d9 tracing/kprobes: Use rcu_dereference_raw for tp->files
Use rcu_dereference_raw() for accessing tp->files. Because the
write-side uses rcu_assign_pointer() for memory barrier,
the read-side also has to use rcu_dereference_raw() with
read memory barrier.

Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-15 13:50:22 -04:00
Steven Rostedt (Red Hat)
60705c8946 tracing: Fix leaks of filter preds
Special preds are created when folding a series of preds that
can be done in serial. These are allocated in an ops field of
the pred structure. But they were never freed, causing memory
leaks.

This was discovered using the kmemleak checker:

unreferenced object 0xffff8800797fd5e0 (size 32):
  comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
  hex dump (first 32 bytes):
    00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
    [<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff81120e68>] __kmalloc+0xd7/0x125
    [<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
    [<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
    [<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
    [<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
    [<ffffffff810d50b5>] create_filter+0x4e/0x8b
    [<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
    [<ffffffff8100028d>] do_one_initcall+0xa0/0x137
    [<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
    [<ffffffff814b24b7>] kernel_init+0xe/0xdb
    [<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-15 13:49:18 -04:00
Sasha Levin
615ee5443f rcu: Don't allocate bootmem from rcu_init()
When rcu_init() is called we already have slab working, allocating
bootmem at that point results in warnings and an allocation from
slab.  This commit therefore changes alloc_bootmem_cpumask_var() to
alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
from rcu_init().

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Robin Holt <holt@sgi.com>

[paulmck: convert to zalloc_cpumask_var(), as suggested by Yinghai Lu.]
2013-05-15 10:41:12 -07:00
David Howells
cb65537ee1 Add wait_on_atomic_t() and wake_up_atomic_t()
Add wait_on_atomic_t() and wake_up_atomic_t() to indicate became-zero events on
atomic_t types.  This uses the bit-wake waitqueue table.  The key is set to a
value outside of the number of bits in a long so that wait_on_bit() won't be
woken up accidentally.

What I'm using this for is: in a following patch I add a counter to struct
fscache_cookie to count the number of outstanding operations that need access
to netfs data.  The way this works is:

 (1) When a cookie is allocated, the counter is initialised to 1.

 (2) When an operation wants to access netfs data, it calls atomic_inc_unless()
     to increment the counter before it does so.  If it was 0, then the counter
     isn't incremented, the operation isn't permitted to access the netfs data
     (which might by this point no longer exist) and the operation aborts in
     some appropriate manner.

 (3) When an operation finishes with the netfs data, it decrements the counter
     and if it reaches 0, calls wake_up_atomic_t() on it - the assumption being
     that it was the last blocker.

 (4) When a cookie is released, the counter is decremented and the releaser
     uses wait_on_atomic_t() to wait for the counter to become 0 - which should
     indicate no one is using the netfs data any longer.  The netfs data can
     then be destroyed.

There are some alternatives that I have thought of and that have been suggested
by Tejun Heo:

 (A) Using wait_on_bit() to wait on a bit in the counter.  This doesn't work
     because if that bit happens to be 0 then the wait won't happen - even if
     the counter is non-zero.

 (B) Using wait_on_bit() to wait on a flag elsewhere which is cleared when the
     counter reaches 0.  Such a flag would be redundant and would add
     complexity.

 (C) Adding a waitqueue to fscache_cookie - this would expand that struct by
     several words for an event that happens just once in each cookie's
     lifetime.  Further, cookies are generally per-file so there are likely to
     be a lot of them.

 (D) Similar to (C), but add a pointer to a waitqueue in the cookie instead of
     a waitqueue.  This would add single word per cookie and so would be less
     of an expansion - but still an expansion.

 (E) Adding a static waitqueue to the fscache module.  Generally this would be
     fine, but under certain circumstances many cookies will all get added at
     the same time (eg. NFS umount, cache withdrawal) thereby presenting
     scaling issues.  Note that the wait may be significant as disk I/O may be
     in progress.

So, I think reusing the wait_on_bit() waitqueue set is reasonable.  I don't
make much use of the waitqueue I need on a per-cookie basis, but sometimes I
have a huge flood of the cookies to deal with.

I also don't want to add a whole new set of global waitqueue tables
specifically for the dec-to-0 event if I can reuse the bit tables.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-05-15 13:50:38 +01:00
John Stultz
b4f711ee03 time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
Kay Sievers noted that the ALWAYS_USE_PERSISTENT_CLOCK config,
which enables some minor compile time optimization to avoid
uncessary code in mostly the suspend/resume path could cause
problems for userland.

In particular, the dependency for RTC_HCTOSYS on
!ALWAYS_USE_PERSISTENT_CLOCK, which avoids setting the time
twice and simplifies suspend/resume, has the side effect
of causing the /sys/class/rtc/rtcN/hctosys flag to always be
zero, and this flag is commonly used by udev to setup the
/dev/rtc symlink to /dev/rtcN, which can cause pain for
older applications.

While the udev rules could use some work to be less fragile,
breaking userland should strongly be avoided. Additionally
the compile time optimizations are fairly minor, and the code
being optimized is likely to be reworked in the future, so
lets revert this change.

Reported-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: stable <stable@vger.kernel.org> #3.9
Cc: Feng Tang <feng.tang@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Link: http://lkml.kernel.org/r/1366828376-18124-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-14 20:54:06 +02:00
Marc Dionne
ad7b1f841f workqueue: Make schedule_work() available again to non GPL modules
Commit 8425e3d5bd ("workqueue: inline trivial wrappers") changed
schedule_work() and schedule_delayed_work() to inline wrappers,
but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
the original functions were EXPORT_SYMBOL.  This has the effect of
changing the licensing requirement for these functions and making
them unavailable to non GPL modules.

Make them available again by removing the restriction on the
required symbols.

Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 11:52:51 -07:00
Joonsoo Kim
8f174b1175 workqueue: correct handling of the pool spin_lock
When we fail to mutex_trylock(), we release the pool spin_lock and do
mutex_lock(). After that, we should regrab the pool spin_lock, but,
regrabbing is missed in current code. So correct it.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 11:48:15 -07:00
Tejun Heo
857a2beb09 cgroup: implement task_cgroup_path_from_hierarchy()
kdbus folks want a sane way to determine the cgroup path that a given
task belongs to on a given hierarchy, which is a reasonble thing to
expect from cgroup core.

Implement task_cgroup_path_from_hierarchy().

v2: Dropped unnecessary NULL check on the return value of
    task_cgroup_from_root() as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <greg@kroah.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <daniel@zonque.org>
2013-05-14 11:42:07 -07:00
Tejun Heo
1a57423166 cgroup: make hierarchy_id use cyclic idr
We want to be able to lookup a hierarchy from its id and cyclic
allocation is a whole lot simpler with idr.  Convert to idr and use
idr_alloc_cyclc().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-05-14 11:42:07 -07:00
Tejun Heo
54e7b4eb15 cgroup: drop hierarchy_id_lock
Now that hierarchy_id alloc / free are protected by the cgroup
mutexes, there's no need for this separate lock.  Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-05-14 11:42:06 -07:00
Tejun Heo
fa3ca07e96 cgroup: refactor hierarchy_id handling
We're planning to converting hierarchy_ida to an idr and use it to
look up hierarchy from its id.  As we want the mapping to happen
atomically with cgroupfs_root registration, this patch refactors
hierarchy_id init / exit so that ida operations happen inside
cgroup_[root_]mutex.

* s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or
  -errno like a normal function.

* Move hierarchy_id initialization from cgroup_root_from_opts() into
  cgroup_mount() block where the root is confirmed to be used and
  being registered while holding both mutexes.

* Split cgroup_drop_id() into cgroup_exit_root_id() and
  cgroup_free_root(), so that ID release can happen before dropping
  the mutexes in cgroup_kill_sb().  The latter expects hierarchy_id to
  be exited before being invoked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-05-14 11:42:05 -07:00
Paul E. McKenney
6faf72834d rcu: Fix comparison sense in rcu_needs_cpu()
Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
callbacks) introduced a bug that can result in excessively long grace
periods.  This bug reverse the senes of the "if" statement checking
for lazy callbacks, so that RCU takes a lazy approach when there are
in fact non-lazy callbacks.  This can result in excessive boot, suspend,
and resume times.

This commit therefore fixes the sense of this "if" statement.

Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Bjørn Mork <bjorn@mork.no>
Reported-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Bjørn Mork <bjorn@mork.no>
Tested-by: Joerg Roedel <joro@8bytes.org>
2013-05-14 10:53:41 -07:00
Viresh Kumar
0668106ca3 workqueue: Add system wide power_efficient workqueues
This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.

tj: updated comments a bit.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 10:50:06 -07:00
Viresh Kumar
cee22a1505 workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.

Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system.  To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.

tj: Updated config description and comments.  Renamed
    CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 10:50:06 -07:00
Linus Torvalds
7fb30d2b60 Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "A fix for a workqueue_congested() regression that broke fscache"

* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number
2013-05-14 09:06:29 -07:00
Tirupathi Reddy
42a5cf46cd timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
An inactive timer's base can refer to a offline cpu's base.

In the current code, cpu_base's lock is blindly reinitialized each
time a CPU is brought up. If a CPU is brought online during the period
that another thread is trying to modify an inactive timer on that CPU
with holding its timer base lock, then the lock will be reinitialized
under its feet. This leads to following SPIN_BUG().

<0> BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466
<0> lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1
<4> [<c0013dc4>] (unwind_backtrace+0x0/0x11c) from [<c026e794>] (do_raw_spin_unlock+0x40/0xcc)
<4> [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) from [<c076c160>] (_raw_spin_unlock+0x8/0x30)
<4> [<c076c160>] (_raw_spin_unlock+0x8/0x30) from [<c009b858>] (mod_timer+0x294/0x310)
<4> [<c009b858>] (mod_timer+0x294/0x310) from [<c00a5e04>] (queue_delayed_work_on+0x104/0x120)
<4> [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) from [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c)
<4> [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) from [<c04d8780>] (sdhci_disable+0x40/0x48)
<4> [<c04d8780>] (sdhci_disable+0x40/0x48) from [<c04bf300>] (mmc_release_host+0x4c/0xb0)
<4> [<c04bf300>] (mmc_release_host+0x4c/0xb0) from [<c04c7aac>] (mmc_sd_detect+0x90/0xfc)
<4> [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) from [<c04c2504>] (mmc_rescan+0x7c/0x2c4)
<4> [<c04c2504>] (mmc_rescan+0x7c/0x2c4) from [<c00a6a7c>] (process_one_work+0x27c/0x484)
<4> [<c00a6a7c>] (process_one_work+0x27c/0x484) from [<c00a6e94>] (worker_thread+0x210/0x3b0)
<4> [<c00a6e94>] (worker_thread+0x210/0x3b0) from [<c00aad9c>] (kthread+0x80/0x8c)
<4> [<c00aad9c>] (kthread+0x80/0x8c) from [<c000ea80>] (kernel_thread_exit+0x0/0x8)

As an example, this particular crash occurred when CPU #3 is executing
mod_timer() on an inactive timer whose base is refered to offlined CPU
#2.  The code locked the timer_base corresponding to CPU #2. Before it
could proceed, CPU #2 came online and reinitialized the spinlock
corresponding to its base. Thus now CPU #3 held a lock which was
reinitialized. When CPU #3 finally ended up unlocking the old cpu_base
corresponding to CPU #2, we hit the above SPIN_BUG().

CPU #0		CPU #3				       CPU #2
------		-------				       -------
.....		 ......				      <Offline>
		mod_timer()
		 lock_timer_base
		   spin_lock_irqsave(&base->lock)

cpu_up(2)	 .....				        ......
							init_timers_cpu()
....		 .....				    	spin_lock_init(&base->lock)
.....		   spin_unlock_irqrestore(&base->lock)  ......
		   <spin_bug>

Allocation of per_cpu timer vector bases is done only once under
"tvec_base_done[]" check. In the current code, spinlock_initialization
of base->lock isn't under this check. When a CPU is up each time the
base lock is reinitialized. Move base spinlock initialization under
the check.

Signed-off-by: Tirupathi Reddy <tirupath@codeaurora.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-14 17:59:18 +02:00
Srivatsa S. Bhat
b47430d3ad rcu/idle: Wrap cpu-idle poll mode within rcu_idle_enter/exit
Bjørn Mork reported the following warning when running powertop.

[   49.289034] ------------[ cut here ]------------
[   49.289055] WARNING: at kernel/rcutree.c:502 rcu_eqs_exit_common.isra.48+0x3d/0x125()
[   49.289244] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-bisect-rcu-warn+ #107
[   49.289251]  ffffffff8157d8c8 ffffffff81801e28 ffffffff8137e4e3 ffffffff81801e68
[   49.289260]  ffffffff8103094f ffffffff81801e68 0000000000000000 ffff88023afcd9b0
[   49.289268]  0000000000000000 0140000000000000 ffff88023bee7700 ffffffff81801e78
[   49.289276] Call Trace:
[   49.289285]  [<ffffffff8137e4e3>] dump_stack+0x19/0x1b
[   49.289293]  [<ffffffff8103094f>] warn_slowpath_common+0x62/0x7b
[   49.289300]  [<ffffffff8103097d>] warn_slowpath_null+0x15/0x17
[   49.289306]  [<ffffffff810a9006>] rcu_eqs_exit_common.isra.48+0x3d/0x125
[   49.289314]  [<ffffffff81079b49>] ? trace_hardirqs_off_caller+0x37/0xa6
[   49.289320]  [<ffffffff810a9692>] rcu_idle_exit+0x85/0xa8
[   49.289327]  [<ffffffff8107076e>] trace_cpu_idle_rcuidle+0xae/0xff
[   49.289334]  [<ffffffff810708b1>] cpu_startup_entry+0x72/0x115
[   49.289341]  [<ffffffff813689e5>] rest_init+0x149/0x150
[   49.289347]  [<ffffffff8136889c>] ? csum_partial_copy_generic+0x16c/0x16c
[   49.289355]  [<ffffffff81a82d34>] start_kernel+0x3f0/0x3fd
[   49.289362]  [<ffffffff81a8274c>] ? repair_env_string+0x5a/0x5a
[   49.289368]  [<ffffffff81a82481>] x86_64_start_reservations+0x2a/0x2c
[   49.289375]  [<ffffffff81a82550>] x86_64_start_kernel+0xcd/0xd1
[   49.289379] ---[ end trace 07a1cc95e29e9036 ]---

The warning is that 'rdtp->dynticks' has an unexpected value, which roughly
translates to - the calls to rcu_idle_enter() and rcu_idle_exit() were not
made in the correct order, or otherwise messed up.

And Bjørn's painstaking debugging indicated that this happens when the idle
loop enters the poll mode. Looking at the poll function cpu_idle_poll(), and
the implementation of trace_cpu_idle_rcuidle(), the problem becomes very clear:
cpu_idle_poll() lacks calls to rcu_idle_enter/exit(), and trace_cpu_idle_rcuidle()
calls them in the reverse order - first rcu_idle_exit(), and then rcu_idle_enter().
Hence the even/odd alternative sequencing of rdtp->dynticks goes for a toss.

And powertop readily triggers this because powertop uses the idle-tracing
infrastructure extensively.

So, to fix this, wrap the code in cpu_idle_poll() within rcu_idle_enter/exit(),
so that it blends properly with the calls inside trace_cpu_idle_rcuidle() and
thus get the function ordering right.

Reported-and-tested-by: Bjørn Mork <bjorn@mork.no>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/519169BF.4080208@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-14 17:43:29 +02:00
Thomas Gleixner
f7ea0fd639 tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
commit 5b39939a4 (nohz: Move ts->idle_calls incrementation into strict
idle logic) moved code out of tick_nohz_stop_sched_tick() and missed
to bail out when the cpu is offline. That's causing subsequent
failures as an offline CPU is supposed to die and not to fiddle with
nohz magic.

Return false in can_stop_idle_tick() if the cpu is offline.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305132138160.2863@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-14 17:40:31 +02:00
Li Zefan
d6cbf35dac cgroup: initialize xattr before calling d_instantiate()
cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.

But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.

This bug has been there since cgroup xattr was introduced.

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14 08:24:15 -07:00
Colin Cross
a2d5f1f5d9 sigtimedwait: use freezable blocking call
Avoid waking up every thread sleeping in a sigtimedwait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:23 +02:00
Colin Cross
b0f8c44f30 nanosleep: use freezable blocking call
Avoid waking up every thread sleeping in a nanosleep call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:23 +02:00
Colin Cross
56467c7697 futex: use freezable blocking call
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Colin Cross
613f5d13b5 freezer: skip waking up tasks with PF_FREEZER_SKIP set
Android goes through suspend/resume very often (every few seconds when
on a busy wifi network with the screen off), and a significant portion
of the energy used to go in and out of suspend is spent in the
freezer.  If a task has called freezer_do_not_count(), don't bother
waking it up.  If it happens to wake up later it will call
freezer_count() and immediately enter the refrigerator.

Combined with patches to convert freezable helpers to use
freezer_do_not_count() and convert common sites where idle userspace
tasks are blocked to use the freezable helpers, this reduces the
time and energy required to suspend and resume.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Colin Cross
18ad0c6297 freezer: shorten freezer sleep time using exponential backoff
All tasks can easily be frozen in under 10 ms, switch to using
an initial 1 ms sleep followed by exponential backoff until
8 ms.  Also convert the printed time to ms instead of centiseconds.

Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Colin Cross
1b1d2fb444 lockdep: remove task argument from debug_check_no_locks_held
The only existing caller to debug_check_no_locks_held calls it
with 'current' as the task, and the freezer needs to call
debug_check_no_locks_held but doesn't already have a current
task pointer, so remove the argument.  It is already assuming
that the current task is relevant by dumping the current stack
trace as part of the warning.

This was originally part of 6aa9707099 (lockdep: check that
no locks held at freeze time) which was reverted in
dbf520a9d7.

Original-author: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:21 +02:00
Thomas Gleixner
4b0c0f294f tick: Cleanup NOHZ per cpu data on cpu down
Prarit reported a crash on CPU offline/online. The reason is that on
CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
up. If at cpu online an interrupt happens before the per cpu tick
device is registered the irq_enter() check potentially sees stale data
and dereferences a NULL pointer.

Cleanup the data after the cpu is dead.

Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Cc: Mike Galbraith <bitbucket@online.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-12 12:20:09 +02:00
Linus Torvalds
26b840ae5d The majority of these changes are from Masami Hiramatsu bringing
kprobes up to par with the latest changes to ftrace (multi buffering
 and the new function probes).
 
 He also discovered and fixed some bugs in doing so. When pulling in his
 patches, I also found a few minor bugs as well and fixed them.
 
 This also includes a compile fix for some archs that select the ring buffer
 but not tracing.
 
 I based this off of the last patch you took from me that fixed the merge
 conflict error, as that was the commit that had all the changes I needed
 for this set of changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRjYnJAAoJEOdOSU1xswtMg9EH/iFs438FgrNMk2ZdQftmqcqA
 cqcactHo1mmoHjAoLZT/oDBjEThhVUuqzMXrFRutSYcTh4PsQEC3arX0mpsC+T12
 UEEV/tZS3TXH+GXEyrOit/O3kzntQcDHwJDV4+0n80IrJmw4IDZbnV3R8DWjS6wp
 so+dq0A1pwehcG/upgpw1oTKsGv1G/p6vyf968B6W44icHEClLiph4JE2kzE6D3r
 fzSpOLaQoBEvwIRf6xRKxi240VqIItXwfG7pwNpPpSC37gRLzm74zGr+Sj93/k1y
 pARbZ/5XO7/pcVYQYupErRAoV5in+QMZ67k5G1vQIvyOS9r039catbQf/7PkzcI=
 =EZCE
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing/kprobes update from Steven Rostedt:
 "The majority of these changes are from Masami Hiramatsu bringing
  kprobes up to par with the latest changes to ftrace (multi buffering
  and the new function probes).

  He also discovered and fixed some bugs in doing so.  When pulling in
  his patches, I also found a few minor bugs as well and fixed them.

  This also includes a compile fix for some archs that select the ring
  buffer but not tracing.

  I based this off of the last patch you took from me that fixed the
  merge conflict error, as that was the commit that had all the changes
  I needed for this set of changes."

* tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Support soft-mode disabling
  tracing/kprobes: Support ftrace_event_file base multibuffer
  tracing/kprobes: Pass trace_probe directly from dispatcher
  tracing/kprobes: Increment probe hit-count even if it is used by perf
  tracing/kprobes: Use bool for retprobe checker
  ftrace: Fix function probe when more than one probe is added
  ftrace: Fix the output of enabled_functions debug file
  ftrace: Fix locking in register_ftrace_function_probe()
  tracing: Add helper function trace_create_new_event() to remove duplicate code
  tracing: Modify soft-mode only if there's no other referrer
  tracing: Indicate enabled soft-mode in enable file
  tracing/kprobes: Fix to increment return event probe hit-count
  ftrace: Cleanup regex_lock and ftrace_lock around hash updating
  ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
  ftrace: Have ftrace_regex_write() return either read or error
  tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
  tracing: Don't succeed if event_enable_func did not register anything
  ring-buffer: Select IRQ_WORK
2013-05-11 17:04:59 -07:00
Linus Torvalds
c4cc75c332 Merge git://git.infradead.org/users/eparis/audit
Pull audit changes from Eric Paris:
 "Al used to send pull requests every couple of years but he told me to
  just start pushing them to you directly.

  Our touching outside of core audit code is pretty straight forward.  A
  couple of interface changes which hit net/.  A simple argument bug
  calling audit functions in namei.c and the removal of some assembly
  branch prediction code on ppc"

* git://git.infradead.org/users/eparis/audit: (31 commits)
  audit: fix message spacing printing auid
  Revert "audit: move kaudit thread start from auditd registration to kaudit init"
  audit: vfs: fix audit_inode call in O_CREAT case of do_last
  audit: Make testing for a valid loginuid explicit.
  audit: fix event coverage of AUDIT_ANOM_LINK
  audit: use spin_lock in audit_receive_msg to process tty logging
  audit: do not needlessly take a lock in tty_audit_exit
  audit: do not needlessly take a spinlock in copy_signal
  audit: add an option to control logging of passwords with pam_tty_audit
  audit: use spin_lock_irqsave/restore in audit tty code
  helper for some session id stuff
  audit: use a consistent audit helper to log lsm information
  audit: push loginuid and sessionid processing down
  audit: stop pushing loginid, uid, sessionid as arguments
  audit: remove the old depricated kernel interface
  audit: make validity checking generic
  audit: allow checking the type of audit message in the user filter
  audit: fix build break when AUDIT_DEBUG == 2
  audit: remove duplicate export of audit_enabled
  Audit: do not print error when LSMs disabled
  ...
2013-05-11 14:29:11 -07:00
Tejun Heo
d325185916 workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number
df2d5ae499 ("workqueue: map an unbound workqueues to multiple per-node
pool_workqueues") made unbound workqueues to map to multiple per-node
pool_workqueues and accordingly updated workqueue_contested() so that,
for unbound workqueues, it maps the specified @cpu to the NUMA node
number to obtain the matching pool_workqueue to query the congested
state.

Before this change, workqueue_congested() ignored @cpu for unbound
workqueues as there was only one pool_workqueue and some users
(fscache) called it with WORK_CPU_UNBOUND.  After the commit, this
causes the following oops as WORK_CPU_UNBOUND gets translated to
garbage by cpu_to_node().

  BUG: unable to handle kernel paging request at ffff8803598d98b8
  IP: [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
  PGD 2421067 PUD 0
  Oops: 0000 [#1] SMP
  CPU: 1 PID: 2689 Comm: cat Tainted: GF            3.9.0-fsdevel+ #4
  task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000
  RIP: 0010:[<ffffffff81043b7e>]  [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
  RSP: 0018:ffff880025807ad8  EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003
  RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0
  RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898
  R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97
  R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0
  FS:  00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Stack:
   ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce
   ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0
   ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8
  Call Trace:
   [<ffffffff810442ce>] workqueue_congested+0xb7/0x12c
   [<ffffffffa00810b0>] fscache_enqueue_object+0xb2/0xe8 [fscache]
   [<ffffffffa007facd>] __fscache_acquire_cookie+0x3b9/0x56c [fscache]
   [<ffffffffa00ad8fe>] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs]
   [<ffffffffa009e112>] do_open+0x9/0xd [nfs]
   [<ffffffff810e804a>] do_dentry_open+0x175/0x24b
   [<ffffffff810e8298>] finish_open+0x41/0x51

Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Howells <dhowells@redhat.com>
Tested-and-Acked-by: David Howells <dhowells@redhat.com>
2013-05-10 11:10:17 -07:00
Linus Torvalds
3644bc2ec7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull stray syscall bits from Al Viro:
 "Several syscall-related commits that were missing from the original"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
  unicore32: just use mmap_pgoff()...
  unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
  x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
2013-05-10 09:21:05 -07:00
Linus Torvalds
f741df1f3a - Artem's removal of dead code continues (RPX, MBX860)
- Two krealloc() abuse fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iEYEABECAAYFAlGLydsACgkQdwG7hYl686O3ZwCfYIJUdE/VADo53j3SwDbCEa15
 pMgAnA13t6dKgC74Xqk5dd65KKaw64WB
 =nbL9
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6

Pull misc fixes from David Woodhouse:
 "This is some miscellaneous cleanups that don't really belong anywhere
  else (or were ignored), that have been sitting in linux-next for some
  time.  Two of them are fixes resulting from my audit of krealloc()
  usage that don't seem to have elicited any response when I posted
  them, and the other three are patches from Artem removing dead code."

* tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6:
  pcmcia: remove RPX board stuff
  m68k: remove rpxlite stuff
  pcmcia: remove Motorola MBX860 support
  params: Fix potential memory leak in add_sysfs_param()
  dell-laptop: Fix krealloc() misuse in parse_da_table()
2013-05-10 09:09:47 -07:00
Nathan Zimmer
424c93fe4c sched: Use this_rq() helper
It is a few instructions more efficent to and slightly more
readable to use this_rq()-> instead of cpu_rq(smp_processor_id())-> .

Size comparison of kernel/sched/fair.o:

   text    data     bss     dec     hex filename
  27972     122      26   28120    6dd8 fair.o.before
  27956     122      26   28104    6dc8 fair.o.after

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1368116643-87971-1-git-send-email-nzimmer@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-10 10:35:56 +02:00
Masami Hiramatsu
b8820084f2 tracing/kprobes: Support soft-mode disabling
Support soft-mode disabling on kprobe-based dynamic events.
Soft-disabling is just ignoring recording if the soft disabled
flag is set.

Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:22:16 -04:00
Masami Hiramatsu
41a7dd420c tracing/kprobes: Support ftrace_event_file base multibuffer
Support multi-buffer on kprobe-based dynamic events by
using ftrace_event_file.

Link: http://lkml.kernel.org/r/20130509054449.30398.88343.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:21:47 -04:00
Masami Hiramatsu
2b106aabe6 tracing/kprobes: Pass trace_probe directly from dispatcher
Pass the pointer of struct trace_probe directly from probe
dispatcher to handlers. This removes redundant container_of
macro uses. Same thing has already done in trace_uprobe.

Link: http://lkml.kernel.org/r/20130509054441.30398.69112.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:19:48 -04:00
Masami Hiramatsu
48182bd226 tracing/kprobes: Increment probe hit-count even if it is used by perf
Increment probe hit-count for profiling even if it is used
by perf tool. Same thing has already done in trace_uprobe.

Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:18:44 -04:00
Masami Hiramatsu
db02038f4e tracing/kprobes: Use bool for retprobe checker
Use bool instead of int for kretprobe checker.

Link: http://lkml.kernel.org/r/20130509054431.30398.38561.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:17:35 -04:00
Steven Rostedt (Red Hat)
19dd603e45 ftrace: Fix function probe when more than one probe is added
When the first function probe is added and the function tracer
is updated the functions are modified to call the probe.
But when a second function is added, it updates the function
records to have the second function also update, but it fails
to update the actual function itself.

This prevents the second (or third or forth and so on) probes
from having their functions called.

  # echo vfs_symlink:enable_event:sched:sched_switch > set_ftrace_filter
  # echo vfs_unlink:enable_event:sched:sched_switch > set_ftrace_filter
  # cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
  # touch /tmp/a
  # rm /tmp/a
  # cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
  # ln -s /tmp/a
  # cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 414/414   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
           <idle>-0     [000] d..3  2847.923031: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=2786 next_prio=120
            <...>-3114  [001] d..4  2847.923035: sched_switch: prev_comm=ln prev_pid=3114 prev_prio=120 prev_state=x ==> next_comm=swapper/1 next_pid=0 next_prio=120
             bash-2786  [000] d..3  2847.923535: sched_switch: prev_comm=bash prev_pid=2786 prev_prio=120 prev_state=S ==> next_comm=kworker/0:1 next_pid=34 next_prio=120
      kworker/0:1-34    [000] d..3  2847.923552: sched_switch: prev_comm=kworker/0:1 prev_pid=34 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
           <idle>-0     [002] d..3  2847.923554: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=2783 next_prio=120
             sshd-2783  [002] d..3  2847.923660: sched_switch: prev_comm=sshd prev_pid=2783 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120

Still need to update the functions even though the probe itself
does not need to be registered again when added a new probe.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:16:27 -04:00
Steven Rostedt (Red Hat)
23ea9c4dda ftrace: Fix the output of enabled_functions debug file
The enabled_functions debugfs file was created to be able to see
what functions have been modified from nops to calling a tracer.

The current method uses the counter in the function record.
As when a ftrace_ops is registered to a function, its count
increases. But that doesn't mean that the function is actively
being traced. /proc/sys/kernel/ftrace_enabled can be set to zero
which would disable it, as well as something can go wrong and
we can think its enabled when only the counter is set.

The record's FTRACE_FL_ENABLED flag is set or cleared when its
function is modified. That is a much more accurate way of knowing
what function is enabled or not.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:16:16 -04:00
Steven Rostedt (Red Hat)
5ae0bf5972 ftrace: Fix locking in register_ftrace_function_probe()
The iteration of the ftrace function list and the call to
ftrace_match_record() need to be protected by the ftrace_lock.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:15:30 -04:00
Steven Rostedt (Red Hat)
da511bf33e tracing: Add helper function trace_create_new_event() to remove duplicate code
Both __trace_add_new_event() and __trace_early_add_new_event() do
basically the same thing, except that __trace_add_new_event() does
a little more.

Instead of having duplicate code between the two functions, add
a helper function trace_create_new_event() that both can use.
This will help against having bugs fixed in one function but not
the other.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:14:49 -04:00
Masami Hiramatsu
1cf4c0732d tracing: Modify soft-mode only if there's no other referrer
Modify soft-mode flag only if no other soft-mode referrer
(currently only the ftrace triggers) by using a reference
counter in each ftrace_event_file.

Without this fix, adding and removing several different
enable/disable_event triggers on the same event clear
soft-mode bit from the ftrace_event_file. This also
happens with a typo of glob on setting triggers.

e.g.

 # echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter
 # cat events/net/netif_rx/enable
 0*
 # echo typo_func:enable_event:net:netif_rx > set_ftrace_filter
 # cat events/net/netif_rx/enable
 0
 # cat set_ftrace_filter
 #### all functions enabled ####
 vfs_symlink:enable_event:net:netif_rx:unlimited

As above, we still have a trigger, but soft-mode is gone.

Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:14:25 -04:00
Masami Hiramatsu
30052170dc tracing: Indicate enabled soft-mode in enable file
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.

Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:14:07 -04:00
Masami Hiramatsu
cce2c8f267 tracing/kprobes: Fix to increment return event probe hit-count
Fix to increment probe hit-count for function return event.

Link: http://lkml.kernel.org/r/20130509054424.30398.34058.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:13:51 -04:00
Masami Hiramatsu
3f2367ba7c ftrace: Cleanup regex_lock and ftrace_lock around hash updating
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.

The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.

1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.

This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.

Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:11:48 -04:00
Masami Hiramatsu
f04f24fb7e ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.

----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter

=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
 (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0

but task is already holding lock:
 (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(ftrace_regex_lock);
  lock(ftrace_regex_lock);

 *** DEADLOCK ***
----

To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.

Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
  non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:10:22 -04:00
Al Viro
c5ddd2024a switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 14:53:20 -04:00
Al Viro
91c2e0bcae unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 13:46:38 -04:00
Steven Rostedt (Red Hat)
7c088b5120 ftrace: Have ftrace_regex_write() return either read or error
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:35:12 -04:00
Steven Rostedt (Red Hat)
ff305ded9f tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.

Added some comments about the code as well, to make it more
understandable.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:30:26 -04:00
Masami Hiramatsu
a5b85bd155 tracing: Don't succeed if event_enable_func did not register anything
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.

This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.

Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:26:01 -04:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Eric Paris
2a0b4be6dd audit: fix message spacing printing auid
The helper function didn't include a leading space, so it was jammed
against the previous text in the audit record.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-08 00:02:19 -04:00
Linus Torvalds
5af43c24ca Merge branch 'akpm' (incoming from Andrew)
Merge more incoming from Andrew Morton:

 - Various fixes which were stalled or which I picked up recently

 - A large rotorooting of the AIO code.  Allegedly to improve
   performance but I don't really have good performance numbers (I might
   have lost the email) and I can't raise Kent today.  I held this out
   of 3.9 and we could give it another cycle if it's all too late/scary.

I ended up taking only the first two thirds of the AIO rotorooting.  I
left the percpu parts and the batch completion for later.  - Linus

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
  aio: don't include aio.h in sched.h
  aio: kill ki_retry
  aio: kill ki_key
  aio: give shared kioctx fields their own cachelines
  aio: kill struct aio_ring_info
  aio: kill batch allocation
  aio: change reqs_active to include unreaped completions
  aio: use cancellation list lazily
  aio: use flush_dcache_page()
  aio: make aio_read_evt() more efficient, convert to hrtimers
  wait: add wait_event_hrtimeout()
  aio: refcounting cleanup
  aio: make aio_put_req() lockless
  aio: do fget() after aio_get_req()
  aio: dprintk() -> pr_debug()
  aio: move private stuff out of aio.h
  aio: add kiocb_cancel()
  aio: kill return value of aio_complete()
  char: add aio_{read,write} to /dev/{null,zero}
  aio: remove retry-based AIO
  ...
2013-05-07 20:49:51 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Eric Paris
82d8da0d46 Revert "audit: move kaudit thread start from auditd registration to kaudit init"
This reverts commit 6ff5e45985.

Conflicts:
	kernel/audit.c

This patch was starting a kthread for all the time.  Since the follow on
patches that required it didn't get finished in 3.10 time, we shouldn't
ship this change in 3.10.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:21 -04:00
Eric W. Biederman
780a7654ce audit: Make testing for a valid loginuid explicit.
audit rule additions containing "-F auid!=4294967295" were failing
with EINVAL because of a regression caused by e1760bd.

Apparently some userland audit rule sets want to know if loginuid uid
has been set and are using a test for auid != 4294967295 to determine
that.

In practice that is a horrible way to ask if a value has been set,
because it relies on subtle implementation details and will break
every time the uid implementation in the kernel changes.

So add a clean way to test if the audit loginuid has been set, and
silently convert the old idiom to the cleaner and more comprehensible
new idiom.

Cc: <stable@vger.kernel.org> # 3.7
Reported-By: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:15 -04:00
Jiri Olsa
52d857a878 perf: Factor out auxiliary events notification
Add perf_event_aux() function to send out all types of
auxiliary events - mmap, task, comm events. For each type
there's match and output functions defined and used as
callbacks during perf_event_aux processing.

This way we can centralize the pmu/context iterating and
event matching logic. Also since lot of the code was
duplicated, this patch reduces the .text size about 2kB
on my setup:

  snipped output from 'objdump -x kernel/events/core.o'

  before:
  Idx Name          Size
    0 .text         0000d313

  after:
  Idx Name          Size
    0 .text         0000cad3

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:17:29 +02:00
Jiri Olsa
524eff183f perf: Fix EXIT event notification
The perf_event_task_ctx() function needs to be called with
preemption disabled, since it's checking for currently
scheduled cpu against event cpu.

We disable preemption for task related perf event context
if there's one defined, leaving up to the chance which cpu
it gets scheduled in.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:17:28 +02:00
Paul Gortmaker
8527632dc9 sched: Move update_load_*() methods from sched.h to fair.c
These inlines are only used by kernel/sched/fair.c so they do
not need to be present in the main kernel/sched/sched.h file.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-3-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:14:51 +02:00
Paul Gortmaker
45ceebf776 sched: Factor out load calculation code from sched/core.c --> sched/proc.c
This large chunk of load calculation code can be easily divorced
from the main core.c scheduler file, with only a couple
prototypes and externs added to a kernel/sched header.

Some recent commits expanded the code and the documentation of
it, making it large enough to warrant separation.  For example,
see:

  556061b, "sched/nohz: Fix rq->cpu_load[] calculations"
  5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more"
  5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again"

More importantly, it helps reduce the size of the main
sched/core.c by yet another significant amount (~600 lines).

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:14:50 +02:00
Benjamin Herrenschmidt
5fe0c1f2f0 irqdomain: Allow quiet failure mode
Some interrupt controllers refuse to map interrupts marked as
"protected" by firwmare. Since we try to map everyting in the
device-tree on some platforms, we end up with a lot of nasty
WARN's in the boot log for what is a normal situation on those
machines.

This defines a specific return code (-EPERM) from the host map()
callback which cause irqdomain to fail silently.

MPIC is updated to return this when hitting a protected source
printing only a single line message for diagnostic purposes.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06 11:37:43 +10:00
Linus Torvalds
534c97b095 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'full dynticks' support from Ingo Molnar:
 "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
  kernel feature to the timer and scheduler subsystems: 'full dynticks',
  or CONFIG_NO_HZ_FULL=y.

  This feature extends the nohz variable-size timer tick feature from
  idle to busy CPUs (running at most one task) as well, potentially
  reducing the number of timer interrupts significantly.

  This feature got motivated by real-time folks and the -rt tree, but
  the general utility and motivation of full-dynticks runs wider than
  that:

   - HPC workloads get faster: CPUs running a single task should be able
     to utilize a maximum amount of CPU power.  A periodic timer tick at
     HZ=1000 can cause a constant overhead of up to 1.0%.  This feature
     removes that overhead - and speeds up the system by 0.5%-1.0% on
     typical distro configs even on modern systems.

   - Real-time workload latency reduction: CPUs running critical tasks
     should experience as little jitter as possible.  The last remaining
     source of kernel-related jitter was the periodic timer tick.

   - A single task executing on a CPU is a pretty common situation,
     especially with an increasing number of cores/CPUs, so this feature
     helps desktop and mobile workloads as well.

  The cost of the feature is mainly related to increased timer
  reprogramming overhead when a CPU switches its tick period, and thus
  slightly longer to-idle and from-idle latency.

  Configuration-wise a third mode of operation is added to the existing
  two NOHZ kconfig modes:

   - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
     as a config option.  This is the traditional Linux periodic tick
     design: there's a HZ tick going on all the time, regardless of
     whether a CPU is idle or not.

   - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
     periodic tick when a CPU enters idle mode.

   - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
     tick when a CPU is idle, also slows the tick down to 1 Hz (one
     timer interrupt per second) when only a single task is running on a
     CPU.

  The .config behavior is compatible: existing !CONFIG_NO_HZ and
  CONFIG_NO_HZ=y settings get translated to the new values, without the
  user having to configure anything.  CONFIG_NO_HZ_FULL is turned off by
  default.

  This feature is based on a lot of infrastructure work that has been
  steadily going upstream in the last 2-3 cycles: related RCU support
  and non-periodic cputime support in particular is upstream already.

  This tree adds the final pieces and activates the feature.  The pull
  request is marked RFC because:

   - it's marked 64-bit only at the moment - the 32-bit support patch is
     small but did not get ready in time.

   - it has a number of fresh commits that came in after the merge
     window.  The overwhelming majority of commits are from before the
     merge window, but still some aspects of the tree are fresh and so I
     marked it RFC.

   - it's a pretty wide-reaching feature with lots of effects - and
     while the components have been in testing for some time, the full
     combination is still not very widely used.  That it's default-off
     should reduce its regression abilities and obviously there are no
     known regressions with CONFIG_NO_HZ_FULL=y enabled either.

   - the feature is not completely idempotent: there is no 100%
     equivalent replacement for a periodic scheduler/timer tick.  In
     particular there's ongoing work to map out and reduce its effects
     on scheduler load-balancing and statistics.  This should not impact
     correctness though, there are no known regressions related to this
     feature at this point.

   - it's a pretty ambitious feature that with time will likely be
     enabled by most Linux distros, and we'd like you to make input on
     its design/implementation, if you dislike some aspect we missed.
     Without flaming us to crisp! :-)

  Future plans:

   - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
     the periodic tick altogether when there's a single busy task on a
     CPU.  We'd first like 1 Hz to be exposed more widely before we go
     for the 0 Hz target though.

   - once we reach 0 Hz we can remove the periodic tick assumption from
     nr_running>=2 as well, by essentially interrupting busy tasks only
     as frequently as the sched_latency constraints require us to do -
     once every 4-40 msecs, depending on nr_running.

  I am personally leaning towards biting the bullet and doing this in
  v3.10, like the -rt tree this effort has been going on for too long -
  but the final word is up to you as usual.

  More technical details can be found in Documentation/timers/NO_HZ.txt"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  sched: Keep at least 1 tick per second for active dynticks tasks
  rcu: Fix full dynticks' dependency on wide RCU nocb mode
  nohz: Protect smp_processor_id() in tick_nohz_task_switch()
  nohz_full: Add documentation.
  cputime_nsecs: use math64.h for nsec resolution conversion helpers
  nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
  nohz: Reduce overhead under high-freq idling patterns
  nohz: Remove full dynticks' superfluous dependency on RCU tree
  nohz: Fix unavailable tick_stop tracepoint in dynticks idle
  nohz: Add basic tracing
  nohz: Select wide RCU nocb for full dynticks
  nohz: Disable the tick when irq resume in full dynticks CPU
  nohz: Re-evaluate the tick for the new task after a context switch
  nohz: Prepare to stop the tick on irq exit
  nohz: Implement full dynticks kick
  nohz: Re-evaluate the tick from the scheduler IPI
  sched: New helper to prevent from stopping the tick in full dynticks
  sched: Kick full dynticks CPU that have more than one task enqueued.
  perf: New helper to prevent full dynticks CPUs from stopping tick
  perf: Kick full dynticks CPU if events rotation is needed
  ...
2013-05-05 13:23:27 -07:00
Linus Torvalds
64049d1973 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes plus a small hw-enablement patch for Intel IB model 58
  uncore events"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
  perf/x86/intel/lbr: Fix LBR filter
  perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
  perf: Fix vmalloc ring buffer pages handling
  perf/x86/intel: Fix unintended variable name reuse
  perf/x86/intel: Add support for IvyBridge model 58 Uncore
  perf/x86/intel: Fix typo in perf_event_intel_uncore.c
  x86: Eliminate irq_mis_count counted in arch_irq_stat
2013-05-05 11:37:16 -07:00
Linus Torvalds
f8ce1faf55 We get rid of the general module prefix confusion with a binary config option,
fix a remove/insert race which Never Happens, and (my favorite) handle the
 case when we have too many modules for a single commandline.  Seriously,
 the kernel is full, please go away!
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRgbGEAAoJENkgDmzRrbjx2+QP/jXs93K/sXw3rL0vBklwCFv6
 IPZmqYZiGjrzqlB4coWkgYRwW1oOsREfAjF5MmfPdykS3fO5kXfdxN4FBdfKp+IZ
 RdsycdGDuSxWomgYsivrrxLBDxDAX1VuBOjr6mu5Uuk/pCjFa61cfJDiErsu0jKz
 2EMTc98A+E71XamJdvbtal5MUIu9yeluJWG2ux2+VbCul4MSpMc//0n2nrws/RCB
 AoC96AT/Xf4U10a8zT8RfCJ29M5Vvx/KfTIcFiZvtCQxEaHNNmj831gDNiw/3jFI
 ndRph+VLHBsMoBMxfzNRrM+evqkq8+AGEGRj3ycQy5Pa6DunPyzMafWOVGBGnmaS
 tl9hATGx1438048i5tUn8ieAYG1YL1HM83hQovpCThfUKQMiq186iDt1SYYmlq3g
 0thj3znQqZDYhboPtgWzOMUdqOG/iBIKjhGQjjHZs+MInFgxL2hmax0gBNkvEtQb
 oLyfGbF6UjS7I/Md/HohnUQ4xr9kYa3MQeqPjKbRwgHRkdXhzTEZtI+MYDJBxOnW
 QGVQ97aJ2WA7vC7sz/1VhTcZqmU5zfrSc8lF+Ea+H8dQGHHbz8HxKQacEvKcMrXl
 OJyEkRUWDA0MTjeIHzn2fff9Q6/qqA1QejRiFofGJrpxopcJS84/7yA0repxvuMG
 yaMPsLq53UW37/AXYsho
 =MPiD
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull mudule updates from Rusty Russell:
 "We get rid of the general module prefix confusion with a binary config
  option, fix a remove/insert race which Never Happens, and (my
  favorite) handle the case when we have too many modules for a single
  commandline.  Seriously, the kernel is full, please go away!"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
  X.509: Support parse long form of length octets in Authority Key Identifier
  module: don't unlink the module until we've removed all exposure.
  kernel: kallsyms: memory override issue, need check destination buffer length
  MODSIGN: do not send garbage to stderr when enabling modules signature
  modpost: handle huge numbers of modules.
  modpost: add -T option to read module names from file/stdin.
  modpost: minor cleanup.
  genksyms: pass symbol-prefix instead of arch
  module: fix symbol versioning with symbol prefixes
  CONFIG_SYMBOL_PREFIX: cleanup.
2013-05-05 10:58:06 -07:00
Thomas Gleixner
fbd44a607a tick: Use zalloc_cpumask_var for allocating offstack cpumasks
commit b352bc1cbc (tick: Convert broadcast cpu bitmaps to
cpumask_var_t) broke CONFIG_CPUMASK_OFFSTACK in a very subtle way.

Instead of allocating the cpumasks with zalloc_cpumask_var it uses
alloc_cpumask_var, so we can get random data there, which of course
confuses the logic completely and causes random failures.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305032015060.2990@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-05 11:12:19 +02:00
Al Viro
7ee2b9e564 rcutrace: single_open() leaks
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-05 00:16:35 -04:00
Linus Torvalds
bd932ae1bd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second round of VFS updates from Al Viro:
 "Assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()"
  hostfs: use kmalloc instead of kzalloc
  hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h>
  hostfs: remove "will unlock" comment
  vfs: use list_move instead of list_del/list_add
  proc_devtree: Replace include linux/module.h with linux/export.h
  create_mnt_ns: unidiomatic use of list_add()
  fs: remove dentry_lru_prune()
  Removed unused typedef to avoid "unused local typedef" warnings.
  kill fs/read_write.h
  fs: Fix hang with BSD accounting on frozen filesystem
  sun3_scsi: add ->show_info()
  nubus: Kill nubus_proc_detach_device()
  more mode_t whack-a-mole...
  do_coredump(): don't wait for thaw if coredump has already been interrupted
  do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")
2013-05-04 13:29:38 -07:00
Jan Kara
5ae98f1589 fs: Fix hang with BSD accounting on frozen filesystem
When BSD process accounting is enabled and logs information to a
filesystem which gets frozen, system easily becomes unusable because
each attempt to account process information blocks. Thus e.g. every task
gets blocked in exit.

It seems better to drop accounting information (which can already happen
when filesystem is running out of space) instead of locking system up.
So we just skip the write if the filesystem is frozen.

Reported-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04 14:57:58 -04:00
Frederic Weisbecker
265f22a975 sched: Keep at least 1 tick per second for active dynticks tasks
The scheduler doesn't yet fully support environments
with a single task running without a periodic tick.

In order to ensure we still maintain the duties of scheduler_tick(),
keep at least 1 tick per second.

This makes sure that we keep the progression of various scheduler
accounting and background maintainance even with a very low granularity.
Examples include cpu load, sched average, CFS entity vruntime,
avenrun and events such as load balancing, amongst other details
handled in sched_class::task_tick().

This limitation will be removed in the future once we get
these individual items to work in full dynticks CPUs.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-05-04 08:32:02 +02:00
Frederic Weisbecker
73c3082877 rcu: Fix full dynticks' dependency on wide RCU nocb mode
Commit 0637e02939
("nohz: Select wide RCU nocb for full dynticks") intended
to force CONFIG_RCU_NOCB_CPU_ALL=y when full dynticks is
enabled.

However this option is part of a choice menu and Kconfig's
"select" instruction has no effect on such targets.

Fix this by using reverse dependencies on the targets we
don't want instead.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-05-04 08:30:34 +02:00
Steven Rostedt (Red Hat)
2228768885 ring-buffer: Select IRQ_WORK
As the wake up logic for waiters on the buffer has been moved
from the tracing code to the ring buffer, it requires also adding
IRQ_WORK as the wake up code is performed via irq_work.

This fixes compile breakage when a user of the ring buffer is selected
but tracing and irq_work are not.

Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost

Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-03 19:24:17 -04:00
Linus Torvalds
20a2078ce7 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "This is the main drm pull request for 3.10.

  Wierd bits:
   - OMAP drm changes required OMAP dss changes, in drivers/video, so I
     took them in here.
   - one more fbcon fix for font handover
   - VT switch avoidance in pm code
   - scatterlist helpers for gpu drivers - have acks from akpm

  Highlights:
   - qxl kms driver - driver for the spice qxl virtual GPU

  Nouveau:
   - fermi/kepler VRAM compression
   - GK110/nvf0 modesetting support.

  Tegra:
   - host1x core merged with 2D engine support

  i915:
   - vt switchless resume
   - more valleyview support
   - vblank fixes
   - modesetting pipe config rework

  radeon:
   - UVD engine support
   - SI chip tiling support
   - GPU registers initialisation from golden values.

  exynos:
   - device tree changes
   - fimc block support

  Otherwise:
   - bunches of fixes all over the place."

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
  qxl: update to new idr interfaces.
  drm/nouveau: fix build with nv50->nvc0
  drm/radeon: fix handling of v6 power tables
  drm/radeon: clarify family checks in pm table parsing
  drm/radeon: consolidate UVD clock programming
  drm/radeon: fix UPLL_REF_DIV_MASK definition
  radeon: add bo tracking debugfs
  drm/radeon: add new richland pci ids
  drm/radeon: add some new SI PCI ids
  drm/radeon: fix scratch reg handling for UVD fence
  drm/radeon: allocate SA bo in the requested domain
  drm/radeon: fix possible segfault when parsing pm tables
  drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
  OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
  OMAPDSS: VENC: Add error handling for venc_probe_pdata
  OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
  OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
  OMAPDSS: DSI: Add error handling for dsi_probe_pdata
  OMAPDSS: SDI: Add error handling for sdi_probe_pdata
  OMAPDSS: DPI: Add error handling for dpi_probe_pdata
  ...
2013-05-02 19:40:34 -07:00
Linus Torvalds
0279b3c0ad Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "This fixes the cputime scaling overflow problems for good without
  having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
  as well."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "math64: New div64_u64_rem helper"
  sched: Avoid prev->stime underflow
  sched: Do not account bogus utime
  sched: Avoid cputime scaling overflow
2013-05-02 14:56:31 -07:00
Frederic Weisbecker
c032862fba Merge commit '8700c95adb03' into timers/nohz
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.

Merge a common upstream merge point that has these
updates.

Conflicts:
	include/linux/perf_event.h
	kernel/rcutree.h
	kernel/rcutree_plugin.h

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-05-02 17:54:19 +02:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
David Howells
a8ca16ea7b proc: Supply a function to remove a proc entry by PDE
Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Grant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
Al Viro
8d8b97ba49 take cgroup_open() and cpuset_open() to fs/proc/base.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
David Howells
0bb80f2405 proc: Split the namespace stuff out into linux/proc_ns.h
Split the proc namespace stuff out into linux/proc_ns.h.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:39 -04:00
David Howells
271a15eabe proc: Supply PDE attribute setting accessor functions
Supply accessor functions to set attributes in proc_dir_entry structs.

The following are supplied: proc_set_size() and proc_set_user().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
cc: linuxppc-dev@lists.ozlabs.org
cc: linux-media@vger.kernel.org
cc: netdev@vger.kernel.org
cc: linux-wireless@vger.kernel.org
cc: linux-pci@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:18 -04:00
Linus Torvalds
73287a43cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights (1721 non-merge commits, this has to be a record of some
  sort):

   1) Add 'random' mode to team driver, from Jiri Pirko and Eric
      Dumazet.

   2) Make it so that any driver that supports configuration of multiple
      MAC addresses can provide the forwarding database add and del
      calls by providing a default implementation and hooking that up if
      the driver doesn't have an explicit set of handlers.  From Vlad
      Yasevich.

   3) Support GSO segmentation over tunnels and other encapsulating
      devices such as VXLAN, from Pravin B Shelar.

   4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

   5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
      Dukkipati.

   6) In the PHY layer, allow supporting wake-on-lan in situations where
      the PHY registers have to be written for it to be configured.

      Use it to support wake-on-lan in mv643xx_eth.

      From Michael Stapelberg.

   7) Significantly improve firewire IPV6 support, from YOSHIFUJI
      Hideaki.

   8) Allow multiple packets to be sent in a single transmission using
      network coding in batman-adv, from Martin Hundebøll.

   9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

  10) Generalize the VXLAN forwarding tables so that there is more
      flexibility in configurating various aspects of the endpoints.
      From David Stevens.

  11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
      from Dmitry Kravkov.

  12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
      Neira Ayuso.

  13) Start adding networking selftests.

  14) In situations of overload on the same AF_PACKET fanout socket, or
      per-cpu packet receive queue, minimize drop by distributing the
      load to other cpus/fanouts.  From Willem de Bruijn and Eric
      Dumazet.

  15) Add support for new payload offset BPF instruction, from Daniel
      Borkmann.

  16) Convert several drivers over to mdoule_platform_driver(), from
      Sachin Kamat.

  17) Provide a minimal BPF JIT image disassembler userspace tool, from
      Daniel Borkmann.

  18) Rewrite F-RTO implementation in TCP to match the final
      specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.

  19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
      you like netlink, so I implemented netlink dumping of netlink
      sockets.") From Andrey Vagin.

  20) Remove ugly passing of rtnetlink attributes into rtnl_doit
      functions, from Thomas Graf.

  21) Allow userspace to be able to see if a configuration change occurs
      in the middle of an address or device list dump, from Nicolas
      Dichtel.

  22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
      Frederic Sowa.

  23) Increase accuracy of packet length used by packet scheduler, from
      Jason Wang.

  24) Beginning set of changes to make ipv4/ipv6 fragment handling more
      scalable and less susceptible to overload and locking contention,
      from Jesper Dangaard Brouer.

  25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
      instead.  From Hong Zhiguo.

  26) Optimize route usage in IPVS by avoiding reference counting where
      possible, from Julian Anastasov.

  27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

  28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
      Eitzenberger.

  29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
      nfnetlink_log, and nfnetlink_queue.  From Gao feng.

  30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

  31) Support several new r8169 chips, from Hayes Wang.

  32) Support tokenized interface identifiers in ipv6, from Daniel
      Borkmann.

  33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

  34) Add 802.1ad vlan offload support, from Patrick McHardy.

  35) Support mmap() based netlink communication, also from Patrick
      McHardy.

  36) Support HW timestamping in mlx4 driver, from Amir Vadai.

  37) Rationalize AF_PACKET packet timestamping when transmitting, from
      Willem de Bruijn and Daniel Borkmann.

  38) Bring parity to what's provided by /proc/net/packet socket dumping
      and the info provided by netlink socket dumping of AF_PACKET
      sockets.  From Nicolas Dichtel.

  39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
      Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  filter: fix va_list build error
  af_unix: fix a fatal race with bit fields
  bnx2x: Prevent memory leak when cnic is absent
  bnx2x: correct reading of speed capabilities
  net: sctp: attribute printl with __printf for gcc fmt checks
  netlink: kconfig: move mmap i/o into netlink kconfig
  netpoll: convert mutex into a semaphore
  netlink: Fix skb ref counting.
  net_sched: act_ipt forward compat with xtables
  mlx4_en: fix a build error on 32bit arches
  Revert "bnx2x: allow nvram test to run when device is down"
  bridge: avoid OOPS if root port not found
  drivers: net: cpsw: fix kernel warn on cpsw irq enable
  sh_eth: use random MAC address if no valid one supplied
  3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
  tg3: fix to append hardware time stamping flags
  unix/stream: fix peeking with an offset larger than data in queue
  unix/dgram: fix peeking with an offset larger than data in queue
  unix/dgram: peek beyond 0-sized skbs
  openvswitch: Remove unneeded ovs_netdev_get_ifindex()
  ...
2013-05-01 14:08:52 -07:00
Linus Torvalds
08d7676083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
 "Mostly about syscall wrappers this time; there will be another pile
  with patches in the same general area from various people, but I'd
  rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  syscalls.h: slightly reduce the jungles of macros
  get rid of union semop in sys_semctl(2) arguments
  make do_mremap() static
  sparc: no need to sign-extend in sync_file_range() wrapper
  ppc compat wrappers for add_key(2) and request_key(2) are pointless
  x86: trim sys_ia32.h
  x86: sys32_kill and sys32_mprotect are pointless
  get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
  merge compat sys_ipc instances
  consolidate compat lookup_dcookie()
  convert vmsplice to COMPAT_SYSCALL_DEFINE
  switch getrusage() to COMPAT_SYSCALL_DEFINE
  switch epoll_pwait to COMPAT_SYSCALL_DEFINE
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
  switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
  make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
  make HAVE_SYSCALL_WRAPPERS unconditional
  consolidate cond_syscall and SYSCALL_ALIAS declarations
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
  get rid of duplicate logics in __SC_....[1-6] definitions
2013-05-01 07:21:43 -07:00
Jiri Olsa
5919b30933 perf: Fix vmalloc ring buffer pages handling
If we allocate perf ring buffer with the size of single (user)
page, we will get memory corruption when releasing itin
rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).

For single page sized ring buffer the page_order is -1 (because
nr_pages is 0). This needs to be recognized in the rb_free_work
function to release proper amount of pages.

Adding data_page_nr function that returns number of allocated
data pages. Customizing the rest of the code to use it.

Reported-by: Jan Stancek <jstancek@redhat.com>
Original-patch-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-01 12:34:46 +02:00
Linus Torvalds
5f56886521 Merge branch 'akpm' (incoming from Andrew)
Merge third batch of fixes from Andrew Morton:
 "Most of the rest.  I still have two large patchsets against AIO and
  IPC, but they're a bit stuck behind other trees and I'm about to
  vanish for six days.

   - random fixlets
   - inotify
   - more of the MM queue
   - show_stack() cleanups
   - DMI update
   - kthread/workqueue things
   - compat cleanups
   - epoll udpates
   - binfmt updates
   - nilfs2
   - hfs
   - hfsplus
   - ptrace
   - kmod
   - coredump
   - kexec
   - rbtree
   - pids
   - pidns
   - pps
   - semaphore tweaks
   - some w1 patches
   - relay updates
   - core Kconfig changes
   - sysrq tweaks"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
  Documentation/sysrq: fix inconstistent help message of sysrq key
  ethernet/emac/sysrq: fix inconstistent help message of sysrq key
  sparc/sysrq: fix inconstistent help message of sysrq key
  powerpc/xmon/sysrq: fix inconstistent help message of sysrq key
  ARM/etm/sysrq: fix inconstistent help message of sysrq key
  power/sysrq: fix inconstistent help message of sysrq key
  kgdb/sysrq: fix inconstistent help message of sysrq key
  lib/decompress.c: fix initconst
  notifier-error-inject: fix module names in Kconfig
  kernel/sys.c: make prctl(PR_SET_MM) generally available
  UAPI: remove empty Kbuild files
  menuconfig: print more info for symbol without prompts
  init/Kconfig: re-order CONFIG_EXPERT options to fix menuconfig display
  kconfig menu: move Virtualization drivers near other virtualization options
  Kconfig: consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS
  relay: use macro PAGE_ALIGN instead of FIX_SIZE
  kernel/relay.c: move FIX_SIZE macro into relay.c
  kernel/relay.c: remove unused function argument actor
  drivers/w1/slaves/w1_ds2760.c: fix the error handling in w1_ds2760_add_slave()
  drivers/w1/slaves/w1_ds2781.c: fix the error handling in w1_ds2781_add_slave()
  ...
2013-04-30 17:37:43 -07:00
zhangwei(Jovi)
28ad585e35 power/sysrq: fix inconstistent help message of sysrq key
Currently help message of /proc/sysrq-trigger highlight its
upper-case characters, like below:

      SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
      memory-full-oom-kill(F) kill-all-tasks(I) ...

this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.

This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.

This patch fix power off sysrq key: "poweroff(o)"

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:10 -07:00
zhangwei(Jovi)
f345650964 kgdb/sysrq: fix inconstistent help message of sysrq key
Currently help message of /proc/sysrq-trigger highlight its upper-case
characters, like below:

      SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
      memory-full-oom-kill(F) kill-all-tasks(I) ...

this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.

This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.

This patch fix kgdb sysrq key: "debug(g)"

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:10 -07:00
Amnon Shiloh
52b3694157 kernel/sys.c: make prctl(PR_SET_MM) generally available
The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

      start_code, end_code, start_data, end_data, start_brk, brk,
      start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file.  This includes:

1. Restoring processes from a checkpoint-file (by all potential
   user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
   and high-availablity).
4. Starting a process from an executable format that is not supported
   by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
   executable that, for confidentiality, licensing or other reasons,
   may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled.  The patch removes those ifdefs.

Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
a05342cbd6 relay: use macro PAGE_ALIGN instead of FIX_SIZE
Macro FIX_SIZE is same as PAGE_ALIGN at present, so use PAGE_ALIGN
instead.

Thanks Andrew found this.

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
536b39ecf1 kernel/relay.c: move FIX_SIZE macro into relay.c
It's better to place FIX_SIZE macro in relay.c, instead of relay.h

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
8359f689e2 kernel/relay.c: remove unused function argument actor
Currently argument `actor' is never used in the relay reading path, so
remove it.

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
liguang
06a6ea3702 semaphore: use `bool' type for semaphore_waiter's up
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
liguang
c74f66ce10 semaphore: use unlikely() for down's timeout
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
Raphael S.Carvalho
5cc5445164 pid_namespace.c/.h: simplify defines
Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Raphael S. Carvalho
8db049b3d6 kernel/pid.c: improve flow of a loop inside alloc_pidmap.
find_next_offset() searches for an available "cleaned bit" in the
respective pid bitmap (page), so returns the offset if found, otherwise
it returns a value equals to BITS_PER_PAGE.

For example, suppose find_next_offset didn't find any available bit, so
there's no purpose to call mk_pid (Wasteful Cpu Cycles).

Therefore, I found it could be better to call mk_pid after the checking
(offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
< BITS_PER_PAGE) results in a "failure", then mk_pid would be called
again afterwards.

[akpm@linux-foundation.org: simplify code]
Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Zhang Yanfei
31c3a3fe07 kexec: Use min() and min_t() to simplify logic
Simplify the logic of variable assignments.

[akpm@linux-foundation.org: replace min_t with min, remove unneeded casts]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Zhang Yanfei
310faaa9b2 kexec: fix wrong types of some local variables
The types of the following local variables:

- ubytes/mbytes in kimage_load_crash_segment()/kimage_load_normal_segment()

- r in vmcoreinfo_append_str()

are wrong, so fix them.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Oleg Nesterov
403bad72b6 coredump: only SIGKILL should interrupt the coredumping task
There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:

- do_coredump() clears TIF_SIGPENDING but of course this can't help
  if, say, SIGCHLD comes after that.

  In this case the coredump can fail unexpectedly. See for example
  wait_for_dump_helper()->signal_pending() check but there are other
  reasons.

- At the same time, dumping a huge core on the slow media can take a
  lot of time/resources and there is no way to kill the coredumping
  task reliably. In particular this is not oom_kill-friendly.

This patch tries to fix the 1st problem, and makes the preparation for the
next changes.

We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core.  prepare_signal() checks this flag and
nacks any signal except SIGKILL.

Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion.  See marc.info/?l=linux-kernel&m=120508897917439

Notes:
	- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
	  The patch assumes that dump_write/etc paths should never
	  call it, but we can change it as well.

	- There is another source of TIF_SIGPENDING, freezer. This
	  will be addressed separately.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
66e5b7e194 kmod: remove call_usermodehelper_fns()
This function suffers from not being able to determine if the cleanup is
called in case it returns -ENOMEM.  Nobody is using it anymore, so let's
remove it.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
f634460c90 kmod: split call to call_usermodehelper_fns()
Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns().  In case the latter returns -ENOMEM the
cleanup function may had not been called - in this case we would not free
argv and module_name.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
938e4b22e2 usermodehelper: export call_usermodehelper_exec() and call_usermodehelper_setup()
call_usermodehelper_setup() + call_usermodehelper_exec() need to be
called instead of call_usermodehelper_fns() when the cleanup function
needs to be called even when an ENOMEM error occurs.  In this case using
call_usermodehelper_fns() the user can't distinguish if the cleanup
function was called or not.

[akpm@linux-foundation.org: export call_usermodehelper_setup() to modules]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:05 -07:00
Andrey Vagin
84c751bd4a ptrace: add ability to retrieve signals without removing from a queue (v4)
This patch adds a new ptrace request PTRACE_PEEKSIGINFO.

This request is used to retrieve information about pending signals
starting with the specified sequence number.  Siginfo_t structures are
copied from the child into the buffer starting at "data".

The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
struct ptrace_peeksiginfo_args {
	u64 off;	/* from which siginfo to start */
	u32 flags;
	s32 nr;		/* how may siginfos to take */
};

"nr" has type "s32", because ptrace() returns "long", which has 32 bits on
i386 and a negative values is used for errors.

Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
signals from process-wide queue.  If this flag is not set, signals are
read from a per-thread queue.

The request PTRACE_PEEKSIGINFO returns a number of dumped signals.  If a
signal with the specified sequence number doesn't exist, ptrace returns
zero.  The request returns an error, if no signal has been dumped.

Errors:
EINVAL - one or more specified flags are not supported or nr is negative
EFAULT - buf or addr is outside your accessible address space.

A result siginfo contains a kernel part of si_code which usually striped,
but it's required for queuing the same siginfo back during restore of
pending signals.

This functionality is required for checkpointing pending signals.  Pedro
Alves suggested using it in "gdb" to peek at pending signals.  gdb already
uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
dequeued.  This functionality allows gdb to look at the pending signals
which were not reported yet.

The prototype of this code was developed by Oleg Nesterov.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pedro Alves <palves@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:05 -07:00
Stephen Rothwell
4a22f16636 kernel/timer.c: move some non timer related syscalls to kernel/sys.c
Andrew Morton noted:

	akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
	SYSCALL_DEFINE1(alarm, unsigned int, seconds)
	SYSCALL_DEFINE0(getpid)
	SYSCALL_DEFINE0(getppid)
	SYSCALL_DEFINE0(getuid)
	SYSCALL_DEFINE0(geteuid)
	SYSCALL_DEFINE0(getgid)
	SYSCALL_DEFINE0(getegid)
	SYSCALL_DEFINE0(gettid)
	SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
	COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

	Only one of those should be in kernel/timer.c.  Who wrote this thing?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Stephen Rothwell
1043f65a57 kernel/timer.c: convert compat_sys_sysinfo to COMPAT_SYSCALL_DEFINE
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Stephen Rothwell
1a0df59444 kernel/compat.c: make do_sysinfo() static
The only use outside of kernel/timer.c was in kernel/compat.c, so move
compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Andrew Morton
e1d12f3270 kernel/smp.c: cleanups
We sometimes use "struct call_single_data *data" and sometimes "struct
call_single_data *csd".  Use "csd" consistently.

We sometimes use "struct call_function_data *data" and sometimes "struct
call_function_data *cfd".  Use "cfd" consistently.

Also, avoid some 80-col layout tricks.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
liguang
3440a1ca99 kernel/smp.c: remove 'priv' of call_single_data
The 'priv' field is redundant; we can pass data via 'info'.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
liguang
1def1dc917 kernel/smp.c: use '|=' for csd_lock
csd_lock() uses assignment to data->flags rather than |=.  That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.

But it will become buggy if we later add another flag, so fix it now.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
3d1cb2059d workqueue: include workqueue info when printing debug dump of a worker task
One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.

This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description.  When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.

The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info().  print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields.  It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.

The description is currently limited to 24bytes including the
terminating \0.  worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.

Here's an example dump with writeback updated to set the bdi name as
worker desc.

 Hardware name: Bochs
 Modules linked in:
 Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
 Workqueue: writeback bdi_writeback_workfn (flush-8:0)
  ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
  ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
  ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
 Call Trace:
  [<ffffffff81c61845>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
 ...

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
cd42d559e4 kthread: implement probe_kthread_data()
One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.

For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.

This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.

An example WARN dump would look like the following.

 WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
 Modules linked in:
 CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
 Workqueue: writeback bdi_writeback_workfn (flush-8:16)
  ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
  ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
  ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
 Call Trace:
  [<ffffffff81c61855>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  ...

This patch:

Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible.  In the former case, probe_kthread_data() may
return any value.  In the latter, NULL.

This will be used to safely print debug information without affecting
synchronization in the normal paths.  Workqueue debug info printing on
dump_stack() and friends will make use of it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Vineet Gupta
681a90ffe8 arc, print-fatal-signals: reduce duplicated information
After the recent generic debug info on dump_stack() and friends, arc
is printing duplicate information on debug dumps.

 [ARCLinux]$ ./crash
 crash/50: potentially unexpected fatal signal 11.	<-- [1]
 /sbin/crash, TGID 50					<-- [2]
 Pid: 50, comm: crash Not tainted 3.9.0-rc4+ #132 	<-- [3]
 ...

Remove them.

[tj@kernel.org: updated patch desc]
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
a43cb95d54 dump_stack: unify debug information printed by show_regs()
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms.  This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.

show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.

* Archs which didn't print debug info now do.

  alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
  metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
  um, xtensa

* Already prints debug info.  Replaced with show_regs_print_info().
  The printed information is superset of what used to be there.

  arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86

* s390 is special in that it used to print arch-specific information
  along with generic debug info.  Heiko and Martin think that the
  arch-specific extra isn't worth keeping s390 specfic implementation.
  Converted to use the generic version.

Note that now all archs print the debug info before actual register
dumps.

An example BUG() dump follows.

 kernel BUG at /work/os/work/kernel/workqueue.c:4841!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
 task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
 RIP: 0010:[<ffffffff8234a07e>]  [<ffffffff8234a07e>] init_workqueues+0x4/0x6
 RSP: 0000:ffff88007c861ec8  EFLAGS: 00010246
 RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
 RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
 RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Stack:
  ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
  0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
  ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
 Call Trace:
  [<ffffffff81000312>] do_one_initcall+0x122/0x170
  [<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  [<ffffffff81c4776e>] kernel_init+0xe/0xf0
  [<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  ...

v2: Typo fix in x86-32.

v3: CPU number dropped from show_regs_print_info() as
    dump_stack_print_info() has been updated to print it.  s390
    specific implementation dropped as requested by s390 maintainers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>		[tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
98e5e1bf72 dump_stack: implement arch-specific hardware description in task dumps
x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.

* x86 show_regs() collects vendor, product and board strings and print
  them out with PID, comm and utsname.  Some of the information is
  printed again later in the same dump.

* warn_slowpath_common() explicitly accesses the DMI board and prints
  it out with "Hardware name:" label.  This applies to both x86 and
  ia64 but is irrelevant on all other archs.

* ia64 doesn't show DMI information on other non-WARN dumps.

This patch introduces arch-specific hardware description used by
dump_stack().  It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.

dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data.  It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message.  It is superset of the
information x86 show_regs() is using.  The function is called from x86
and ia64 boot code right after dmi_scan_machine().

This makes the explicit DMI handling in warn_slowpath_common()
unnecessary.  Removed.

show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs().  The next patch will unify show_regs() handling and
remove the duplication.

An example WARN dump follows.

 WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
 Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
  ...

v2: Use the same string as the debug message from dmi_present() which
    also contains BIOS information.  Move hardware name into its own
    line as warn_slowpath_common() did.  This change was suggested by
    Bjorn Helgaas.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
196779b9b4 dump_stack: consolidate dump_stack() implementations and unify their behaviors
Both dump_stack() and show_stack() are currently implemented by each
architecture.  show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack().  On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.

The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().

There's no reason to require archs to implement two separate but mostly
identical functions.  It leads to unnecessary subtle information.

This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin.  Blackfin's
dump_stack() does something wonky that I don't understand.

Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information.  This is used
in blackfin.

This patch brings the following behavior changes.

* On some archs, an extra level in backtrace for show_stack() could be
  printed.  This is because the top frame was determined in
  dump_stack() on those archs while generic dump_stack() can't do that
  reliably.  It can be compensated by inlining dump_stack() but not
  sure whether that'd be necessary.

* Most archs didn't use to print debug info on dump_stack().  They do
  now.

An example WARN dump follows.

 WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
 Hardware name: empty
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
 Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a071>] init_workqueues+0x35/0x505
  ...

v2: CPU number added to the generic debug info as requested by s390
    folks and dropped the s390 specific dump_stack().  This loses %ksp
    from the debug message which the maintainers think isn't important
    enough to keep the s390-specific dump_stack() implementation.

    dump_stack_print_info() is moved to kernel/printk.c from
    lib/dump_stack.c.  Because linkage is per objecct file,
    dump_stack_print_info() living in the same lib file as generic
    dump_stack() means that archs which implement custom dump_stack()
    - at this point, only blackfin - can't use dump_stack_print_info()
    as that will bring in the generic version of dump_stack() too.  v1
    The v1 patch broke build on blackfin due to this issue.  The build
    breakage was reported by Fengguang Wu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	[s390 bits]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Lin Feng
7fba2c27a3 kernel/range.c: subtract_range: fix the broken phrase issued by printk
Also replace deprecated printk(KERN_ERR...) with pr_err() as suggested
by Yinghai, attaching the function name to provide plenty info.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:01 -07:00
Linus Torvalds
2e1deaad1e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem update from James Morris:
 "Just some minor updates across the subsystem"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  ima: eliminate passing d_name.name to process_measurement()
  TPM: Retry SaveState command in suspend path
  tpm/tpm_i2c_infineon: Add small comment about return value of __i2c_transfer
  tpm/tpm_i2c_infineon.c: Add OF attributes type and name to the of_device_id table entries
  tpm_i2c_stm_st33: Remove duplicate inclusion of header files
  tpm: Add support for new Infineon I2C TPM (SLB 9645 TT 1.2 I2C)
  char/tpm: Convert struct i2c_msg initialization to C99 format
  drivers/char/tpm/tpm_ppi: use strlcpy instead of strncpy
  tpm/tpm_i2c_stm_st33: formatting and white space changes
  Smack: include magic.h in smackfs.c
  selinux: make security_sb_clone_mnt_opts return an error on context mismatch
  seccomp: allow BPF_XOR based ALU instructions.
  Fix NULL pointer dereference in smack_inode_unlink() and smack_inode_rmdir()
  Smack: add support for modification of existing rules
  smack: SMACK_MAGIC to include/uapi/linux/magic.h
  Smack: add missing support for transmute bit in smack_str_from_perm()
  Smack: prevent revoke-subject from failing when unseen label is written to it
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss
2013-04-30 16:27:51 -07:00
Linus Torvalds
3ed1c478ef Power management and ACPI updates for 3.10-rc1
- ARM big.LITTLE cpufreq driver from Viresh Kumar.
 
 - exynos5440 cpufreq driver from Amit Daniel Kachhap.
 
 - cpufreq core cleanup and code consolidation from Viresh Kumar and
   Stratos Karafotis.
 
 - cpufreq scalability improvement from Nathan Zimmer.
 
 - AMD "frequency sensitivity feedback" powersave bias for the ondemand
   cpufreq governor from Jacob Shin.
 
 - cpuidle code consolidation and cleanups from Daniel Lezcano.
 
 - ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.
 
 - ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim,
   Lv Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.
 
 - ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
   Yasuaki Ishimatsu, and Rafael J. Wysocki.
 
 - Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements
   from Rafael J. Wysocki and Andy Shevchenko.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJRf8M8AAoJEKhOf7ml8uNsud4P/3cabXP5lDipzibRrpOiONse
 puuvIdhtNdMRMc3t1oSDjNH/w/JA51Gc+ICGFAORiyVmqxBc85mpT6J5ibqV7hNd
 pCqbKJceoB5PajHZSx22e4wG9O7YN1k3r80p38IfFzA+Ct0KNSuE0ixMEfHKYjiq
 p5pXswk6TY3gtBReH9agrafHqDtXw4IMTE0asMuJ+BorPW7vQeiNlrkuA+0qmDuu
 26O0Pm2TVkx1ryfTjdM9zSZ9X2G4JuM8rm1/VFZWQJTExwlv3bA2Za1nvQNJlJ99
 6JZ0JXfAehcEW2Ye0sqsZ8HSEabDVHM29QvvOszJ5RpBXERiOCHOkhvFleCoTpn0
 Xq0rtXPrLMH1G28Ej+cxmsAjfzOLV2Byg30CAoI/GCLuQ+xh+VMCpuNYQuld25CG
 9rtYd0fWESeYsAebhDcX0E3xyzJtbrHtOb9PyGwNkbAJ8YQfhVSMCOPi2SX2wa+Q
 qXLXi2VaHvjBSUKcAv5BmM+Ya57Be+88D0LxbgXbUeOnYefUK1ljldKDDshkMjgG
 P4LPdm4JpoB5ncXSOO1Dz9w9QnNcFexSUySd/TtKLNMha1vEHV8ISzNPYY+9IdXf
 XN0VZbFnUDzdj+Fwna7zyFb1cGihDYJKAtpXvRd8Y6RGUxKx9uGLAFJZw/xZB/cR
 KZKuML5O8MgJuef37F38
 =H/se
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael J Wysocki:

 - ARM big.LITTLE cpufreq driver from Viresh Kumar.

 - exynos5440 cpufreq driver from Amit Daniel Kachhap.

 - cpufreq core cleanup and code consolidation from Viresh Kumar and
   Stratos Karafotis.

 - cpufreq scalability improvement from Nathan Zimmer.

 - AMD "frequency sensitivity feedback" powersave bias for the ondemand
   cpufreq governor from Jacob Shin.

 - cpuidle code consolidation and cleanups from Daniel Lezcano.

 - ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.

 - ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim, Lv
   Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.

 - ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
   Yasuaki Ishimatsu, and Rafael J Wysocki.

 - Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements from
   Rafael J Wysocki and Andy Shevchenko.

* tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (192 commits)
  cpufreq: Revert incorrect commit 5800043
  cpufreq: MAINTAINERS: Add co-maintainer
  cpuidle: add maintainer entry
  ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
  ARM: s3c64xx: cpuidle: use init/exit common routine
  cpufreq: pxa2xx: initialize variables
  ACPI: video: correct acpi_video_bus_add error processing
  SH: cpuidle: use init/exit common routine
  ARM: S5pv210: compiling issue, ARM_S5PV210_CPUFREQ needs CONFIG_CPU_FREQ_TABLE=y
  ACPI: Fix wrong parameter passed to memblock_reserve
  cpuidle: fix comment format
  pnp: use %*phC to dump small buffers
  isapnp: remove debug leftovers
  ARM: imx: cpuidle: use init/exit common routine
  ARM: davinci: cpuidle: use init/exit common routine
  ARM: kirkwood: cpuidle: use init/exit common routine
  ARM: calxeda: cpuidle: use init/exit common routine
  ARM: tegra: cpuidle: use init/exit common routine for tegra3
  ARM: tegra: cpuidle: use init/exit common routine for tegra2
  ARM: OMAP4: cpuidle: use init/exit common routine
  ...
2013-04-30 15:21:02 -07:00
Eric Paris
b24a30a730 audit: fix event coverage of AUDIT_ANOM_LINK
The userspace audit tools didn't like the existing formatting of the
AUDIT_ANOM_LINK event. It needed to be expanded to emit an AUDIT_PATH
event as well, so this implements the change. The bulk of the patch is
moving code out of auditsc.c into audit.c and audit.h for general use.
It expands audit_log_name to include an optional "struct path" argument
for the simple case of just needing to report a pathname. This also
makes
audit_log_task_info available when syscall auditing is not enabled,
since
it is needed in either case for process details.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Steve Grubb <sgrubb@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
7173c54e3a audit: use spin_lock in audit_receive_msg to process tty logging
This function is called when we receive a netlink message from
userspace.  We don't need to worry about it coming from irq context or
irqs making it re-entrant.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Richard Guy Briggs
46e959ea29 audit: add an option to control logging of passwords with pam_tty_audit
Most commands are entered one line at a time and processed as complete lines
in non-canonical mode.  Commands that interactively require a password, enter
canonical mode to do this while shutting off echo.  This pair of features
(icanon and !echo) can be used to avoid logging passwords by audit while still
logging the rest of the command.

Adding a member (log_passwd) to the struct audit_tty_status passed in by
pam_tty_audit allows control of canonical mode without echo per task.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
bde02ca858 audit: use spin_lock_irqsave/restore in audit tty code
Some of the callers of the audit tty function use spin_lock_irqsave/restore.
We were using the forced always enable version, which seems really bad.
Since I don't know every one of these code paths well enough, it makes
sense to just switch everything to the safe version.  Maybe it's a
little overzealous, but it's a lot better than an unlucky deadlock when
we return to a caller with irq enabled and they expect it to be
disabled.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
4d3fb709b2 helper for some session id stuff 2013-04-30 15:31:28 -04:00
Eric Paris
b122c3767c audit: use a consistent audit helper to log lsm information
We have a number of places we were reimplementing the same code to write
out lsm labels.  Just do it one darn place.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
152f497b9b audit: push loginuid and sessionid processing down
Since we are always current, we can push a lot of this stuff to the
bottom and get rid of useless interfaces and arguments.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
dc9eb698f4 audit: stop pushing loginid, uid, sessionid as arguments
We always use current.  Stop pulling this when the skb comes in and
pushing it around as arguments.  Just get it at the end when you need
it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
1890090916 audit: remove the old depricated kernel interface
We used to have an inflexible mechanism to add audit rules to the
kernel.  It hasn't been used in a long time.  Get rid of that stuff.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
ab61d38ed8 audit: make validity checking generic
We have 2 interfaces to send audit rules.  Rather than check validity of
things in 2 places make a helper function.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Stanislaw Gruszka
68aa8efcd1 sched: Avoid prev->stime underflow
Dave Hansen reported strange utime/stime values on his system:
https://lkml.org/lkml/2013/4/4/435

This happens because prev->stime value is bigger than rtime
value. Root of the problem are non-monotonic rtime values (i.e.
current rtime is smaller than previous rtime) and that should be
debugged and fixed.

But since problem did not manifest itself before commit
62188451f0 "cputime: Avoid
multiplication overflow on utime scaling", it should be threated
as regression, which we can easily fixed on cputime_adjust()
function.

For now, let's apply this fix, but further work is needed to fix
root of the problem.

Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:05 +02:00
Stanislaw Gruszka
772c808a25 sched: Do not account bogus utime
Due to rounding in scale_stime(), for big numbers, scaled stime
values will grow in chunks. Since rtime grow in jiffies and we
calculate utime like below:

	prev->stime = max(prev->stime, stime);
	prev->utime = max(prev->utime, rtime - prev->stime);

we could erroneously account stime values as utime. To prevent
that only update prev->{u,s}time values when they are smaller
than current rtime.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:04 +02:00
Stanislaw Gruszka
55eaa7c1f5 sched: Avoid cputime scaling overflow
Here is patch, which adds Linus's cputime scaling algorithm to the
kernel.

This is a follow up (well, fix) to commit
d9a3c9823a ("sched: Lower chances
of cputime scaling overflow") which commit tried to avoid
multiplication overflow, but did not guarantee that the overflow
would not happen.

Linus crated a different algorithm, which completely avoids the
multiplication overflow by dropping precision when numbers are
big.

It was tested by me and it gives good relative error of
scaled numbers. Testing method is described here:
http://marc.info/?l=linux-kernel&m=136733059505406&w=2

Originally-From: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:04 +02:00
Linus Torvalds
e486b4c4ba Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull extable dmesg fixlet from Ingo Molnar:
 "Small tweak to reduce kmsg boot time spam"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  extable: Flip the sorting message
2013-04-30 08:32:16 -07:00
Linus Torvalds
ab86e974f0 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Ingo Molnar:
 "The main changes in this cycle's merge are:

   - Implement shadow timekeeper to shorten in kernel reader side
     blocking, by Thomas Gleixner.

   - Posix timers enhancements by Pavel Emelyanov:

   - allocate timer ID per process, so that exact timer ID allocations
     can be re-created be checkpoint/restore code.

   - debuggability and tooling (/proc/PID/timers, etc.) improvements.

   - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
     processors (Penwell and Cloverview), there is a feature that the
     TSC won't stop in S3 state, so the TSC value won't be reset to 0
     after resume.  This can be taken advantage of by the generic via
     the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
     recover/approximate sleep time, the main (and precise) clocksource
     can be used.

   - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
     CPUs the file goes beyond 4MB of size and thus the current
     simplistic seqfile approach fails.  Convert /proc/timer_list to a
     proper seq_file with its own iterator.

   - Cleanups and refactorings of the core timekeeping code by John
     Stultz.

   - International Atomic Clock time is managed by the NTP code
     internally currently but not exposed externally.  Separate the TAI
     code out and add CLOCK_TAI support and TAI support to the hrtimer
     and posix-timer code, by John Stultz.

   - Add deep idle support enhacement to the broadcast clockevents core
     timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
     clockevents feature (which will be utilized by future clockevents
     driver updates), which allows the use of IRQ affinities to avoid
     spurious wakeups of idle CPUs - the right CPU with an expiring
     timer will be woken.

   - Add new ARM bcm281xx clocksource driver, by Christian Daudt

   - ... various other fixes and cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  clockevents: Set dummy handler on CPU_DEAD shutdown
  timekeeping: Update tk->cycle_last in resume
  posix-timers: Remove unused variable
  clockevents: Switch into oneshot mode even if broadcast registered late
  timer_list: Convert timer list to be a proper seq_file
  timer_list: Split timer_list_show_tickdevices
  posix-timers: Show sigevent info in proc file
  posix-timers: Introduce /proc/PID/timers file
  posix timers: Allocate timer id per process (v2)
  timekeeping: Make sure to notify hrtimers when TAI offset changes
  hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
  hrtimer: Add expiry time overflow check in hrtimer_interrupt
  timekeeping: Shorten seq_count region
  timekeeping: Implement a shadow timekeeper
  timekeeping: Delay update of clock->cycle_last
  timekeeping: Store cycle_last value in timekeeper struct as well
  ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
  timekeeping: Simplify tai updating from do_adjtimex
  timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
  timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
  ...
2013-04-30 08:15:40 -07:00
Linus Torvalds
8700c95adb Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug changes from Ingo Molnar:
 "This is a pretty large, multi-arch series unifying and generalizing
  the various disjunct pieces of idle routines that architectures have
  historically copied from each other and have grown in random, wildly
  inconsistent and sometimes buggy directions:

   101 files changed, 455 insertions(+), 1328 deletions(-)

  this went through a number of review and test iterations before it was
  committed, it was tested on various architectures, was exposed to
  linux-next for quite some time - nevertheless it might cause problems
  on architectures that don't read the mailing lists and don't regularly
  test linux-next.

  This cat herding excercise was motivated by the -rt kernel, and was
  brought to you by Thomas "the Whip" Gleixner."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  idle: Remove GENERIC_IDLE_LOOP config switch
  um: Use generic idle loop
  ia64: Make sure interrupts enabled when we "safe_halt()"
  sparc: Use generic idle loop
  idle: Remove unused ARCH_HAS_DEFAULT_IDLE
  bfin: Fix typo in arch_cpu_idle()
  xtensa: Use generic idle loop
  x86: Use generic idle loop
  unicore: Use generic idle loop
  tile: Use generic idle loop
  tile: Enter idle with preemption disabled
  sh: Use generic idle loop
  score: Use generic idle loop
  s390: Use generic idle loop
  powerpc: Use generic idle loop
  parisc: Use generic idle loop
  openrisc: Use generic idle loop
  mn10300: Use generic idle loop
  mips: Use generic idle loop
  microblaze: Use generic idle loop
  ...
2013-04-30 07:50:17 -07:00
Linus Torvalds
16fa94b532 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this development cycle were:

   - full dynticks preparatory work by Frederic Weisbecker

   - factor out the cpu time accounting code better, by Li Zefan

   - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

   - various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched: Fix init NOHZ_IDLE flag
  sched: Prevent to re-select dst-cpu in load_balance()
  sched: Rename load_balance_tmpmask to load_balance_mask
  sched: Move up affinity check to mitigate useless redoing overhead
  sched: Don't consider other cpus in our group in case of NEWLY_IDLE
  sched: Explicitly cpu_idle_type checking in rebalance_domains()
  sched: Change position of resched_cpu() in load_balance()
  sched: Fix wrong rq's runnable_avg update with rt tasks
  sched: Document task_struct::personality field
  sched/cpuacct/UML: Fix header file dependency bug on the UML build
  cgroup: Kill subsys.active flag
  sched/cpuacct: No need to check subsys active state
  sched/cpuacct: Initialize cpuacct subsystem earlier
  sched/cpuacct: Initialize root cpuacct earlier
  sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
  sched/cpuacct: Clean up cpuacct.h
  sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
  sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
  sched/cpuacct: Add cpuacct_acount_field()
  sched/cpuacct: Add cpuacct_init()
  ...
2013-04-30 07:43:28 -07:00
Linus Torvalds
e0972916e8 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Features:

   - Add "uretprobes" - an optimization to uprobes, like kretprobes are
     an optimization to kprobes.  "perf probe -x file sym%return" now
     works like kretprobes.  By Oleg Nesterov.

   - Introduce per core aggregation in 'perf stat', from Stephane
     Eranian.

   - Add memory profiling via PEBS, from Stephane Eranian.

   - Event group view for 'annotate' in --stdio, --tui and --gtk, from
     Namhyung Kim.

   - Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.

   - Add Ivy Bridge-EP uncore support, by Zheng Yan

   - IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.

   - Add perf test entries for checking breakpoint overflow signal
     handler issues, from Jiri Olsa.

   - Add perf test entry for for checking number of EXIT events, from
     Namhyung Kim.

   - Add perf test entries for checking --cpu in record and stat, from
     Jiri Olsa.

   - Introduce perf stat --repeat forever, from Frederik Deweerdt.

   - Add --no-demangle to report/top, from Namhyung Kim.

   - PowerPC fixes plus a couple of cleanups/optimizations in uprobes
     and trace_uprobes, by Oleg Nesterov.

  Various fixes and refactorings:

   - Fix dependency of the python binding wrt libtraceevent, from
     Naohiro Aota.

   - Simplify some perf_evlist methods and to allow 'stat' to share code
     with 'record' and 'trace', by Arnaldo Carvalho de Melo.

   - Remove dead code in related to libtraceevent integration, from
     Namhyung Kim.

   - Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
     sched lat' back working, by Arnaldo Carvalho de Melo

   - We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
     de Melo.

   - Kill a bunch of die() calls, from Namhyung Kim.

   - Fix build on non-glibc systems due to libio.h absence, from Cody P
     Schafer.

   - Remove some perf_session and tracing dead code, from David Ahern.

   - Honor parallel jobs, fix from Borislav Petkov

   - Introduce tools/lib/lk library, initially just removing duplication
     among tools/perf and tools/vm.  from Borislav Petkov

  ... and many more I missed to list, see the shortlog and git log for
  more details."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
  perf/x86/intel/P4: Robistify P4 PMU types
  perf/x86/amd: Fix AMD NB and L2I "uncore" support
  perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
  perf/x86: Check all MSRs before passing hw check
  perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
  perf/x86/intel: Add Ivy Bridge-EP uncore support
  perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
  perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
  uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
  uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
  uprobes/tracing: Change create_trace_uprobe() to support uretprobes
  uprobes/tracing: Make seq_printf() code uretprobe-friendly
  uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
  uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
  uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
  uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
  uprobes/tracing: Generalize struct uprobe_trace_entry_head
  uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
  uprobes/tracing: Kill the pointless seq_print_ip_sym() call
  uprobes/tracing: Kill the pointless task_pt_regs() calls
  ...
2013-04-30 07:41:01 -07:00
Linus Torvalds
1f889ec62c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle are mostly related to preparatory work
  for the full-dynticks work:

   - Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ take
     advantage of numbered callbacks, do callback accelerations based on
     numbered callbacks.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/960

   - RCU documentation updates.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/570

   - Miscellaneous fixes.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/594"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  rcu: Make rcu_accelerate_cbs() note need for future grace periods
  rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()
  rcu: Rename n_nocb_gp_requests to need_future_gp
  rcu: Push lock release to rcu_start_gp()'s callers
  rcu: Repurpose no-CBs event tracing to future-GP events
  rcu: Rearrange locking in rcu_start_gp()
  rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks
  rcu: Accelerate RCU callbacks at grace-period end
  rcu: Export RCU_FAST_NO_HZ parameters to sysfs
  rcu: Distinguish "rcuo" kthreads by RCU flavor
  rcu: Add event tracing for no-CBs CPUs' grace periods
  rcu: Add event tracing for no-CBs CPUs' callback registration
  rcu: Introduce proper blocking to no-CBs kthreads GP waits
  rcu: Provide compile-time control for no-CBs CPUs
  rcu: Tone down debugging during boot-up and shutdown.
  rcu: Add softirq-stall indications to stall-warning messages
  rcu: Documentation update
  rcu: Make bugginess of code sample more evident
  rcu: Fix hlist_bl_set_first_rcu() annotation
  rcu: Delete unused rcu_node "wakemask" field
  ...
2013-04-30 07:39:01 -07:00
Steven Rostedt
6c24499f40 tracing: Fix small merge bug
During the 3.10 merge, a conflict happened and the resolution was
almost, but not quite, correct. An if statement was reversed.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Duh. That was just silly of me  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:23:08 -07:00
Dmitry Monakhov
b8d4a5bf6a relay: move remove_buf_file inside relay_close_buf
Currently remove_buf_file callback is called from from kobject
release method. This result in follow issue:
# blktrace -d /dev/sda1 -d /dev/sda -o test

blktrace_setup()
 dir = create_dir()
 rchan = relay_open(dir,...)
 ->create_buf_file_callback
    buf_file  = debugfs_create_file(dir, )

Userspace will open buf_file.
Later we make a decision to stop tracing
blktrace_down()
  relay_close(rhcan)  /* just decrement kobj reference  */
                      /* since it is not zero then callback not called */
  debugfs_remove(dir) /* FAIL due to non empty dir   */

Later user space will close the file and file will be deleted,
but directory still exist.
user_space_close()
 ->file_release
   ->release_buf_file_callback
     ->debugfs_remove(buf_file
## TESTCASE:
# blktrace -d /dev/sda1 -d /dev/sda -o test
# After that blktrace infrastructure will remain broken in
# an unusable state so: blktrace -d /dev/sda1 will not work.

In fact this is general issue, blktrace is just one of examples.
We can not reliably remove parent dir until all users close the
buf_file.

Solution: We don't have to wait that long. File should be deleted inside
relay_close_buf().

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-30 13:00:02 +02:00
Linus Torvalds
56847d857c Merge branch 'akpm' (incoming from Andrew)
Merge second batch of fixes from Andrew Morton:

 - various misc bits

 - some printk updates

 - a new "SRAM" driver.

 - MAINTAINERS updates

 - the backlight driver queue

 - checkpatch updates

 - a few init/ changes

 - a huge number of drivers/rtc changes

 - fatfs updates

 - some lib/idr.c work

 - some renaming of the random driver interfaces

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits)
  net: rename random32 to prandom
  net/core: remove duplicate statements by do-while loop
  net/core: rename random32() to prandom_u32()
  net/netfilter: rename random32() to prandom_u32()
  net/sched: rename random32() to prandom_u32()
  net/sunrpc: rename random32() to prandom_u32()
  scsi: rename random32() to prandom_u32()
  lguest: rename random32() to prandom_u32()
  uwb: rename random32() to prandom_u32()
  video/uvesafb: rename random32() to prandom_u32()
  mmc: rename random32() to prandom_u32()
  drbd: rename random32() to prandom_u32()
  kernel/: rename random32() to prandom_u32()
  mm/: rename random32() to prandom_u32()
  lib/: rename random32() to prandom_u32()
  x86: rename random32() to prandom_u32()
  x86: pageattr-test: remove srandom32 call
  uuid: use prandom_bytes()
  raid6test: use prandom_bytes()
  sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
  ...
2013-04-29 19:47:50 -07:00
Linus Torvalds
191a712090 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Fixes and a lot of cleanups.  Locking cleanup is finally complete.
   cgroup_mutex is no longer exposed to individual controlelrs which
   used to cause nasty deadlock issues.  Li fixed and cleaned up quite a
   bit including long standing ones like racy cgroup_path().

 - device cgroup now supports proper hierarchy thanks to Aristeu.

 - perf_event cgroup now supports proper hierarchy.

 - A new mount option "__DEVEL__sane_behavior" is added.  As indicated
   by the name, this option is to be used for development only at this
   point and generates a warning message when used.  Unfortunately,
   cgroup interface currently has too many brekages and inconsistencies
   to implement a consistent and unified hierarchy on top.  The new flag
   is used to collect the behavior changes which are necessary to
   implement consistent unified hierarchy.  It's likely that this flag
   won't be used verbatim when it becomes ready but will be enabled
   implicitly along with unified hierarchy.

   The option currently disables some of broken behaviors in cgroup core
   and also .use_hierarchy switch in memcg (will be routed through -mm),
   which can be used to make very unusual hierarchy where nesting is
   partially honored.  It will also be used to implement hierarchy
   support for blk-throttle which would be impossible otherwise without
   introducing a full separate set of control knobs.

   This is essentially versioning of interface which isn't very nice but
   at this point I can't see any other options which would allow keeping
   the interface the same while moving towards hierarchy behavior which
   is at least somewhat sane.  The planned unified hierarchy is likely
   to require some level of adaptation from userland anyway, so I think
   it'd be best to take the chance and update the interface such that
   it's supportable in the long term.

   Maintaining the existing interface does complicate cgroup core but
   shouldn't put too much strain on individual controllers and I think
   it'd be manageable for the foreseeable future.  Maybe we'll be able
   to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
  cpuset: fix compile warning when CONFIG_SMP=n
  cpuset: fix cpu hotplug vs rebuild_sched_domains() race
  cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
  cgroup: restore the call to eventfd->poll()
  cgroup: fix use-after-free when umounting cgroupfs
  cgroup: fix broken file xattrs
  devcg: remove parent_cgroup.
  memcg: force use_hierarchy if sane_behavior
  cgroup: remove cgrp->top_cgroup
  cgroup: introduce sane_behavior mount option
  move cgroupfs_root to include/linux/cgroup.h
  cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
  cgroup: make cgroup_path() not print double slashes
  Revert "cgroup: remove bind() method from cgroup_subsys."
  perf: make perf_event cgroup hierarchical
  cgroup: implement cgroup_is_descendant()
  cgroup: make sure parent won't be destroyed before its children
  cgroup: remove bind() method from cgroup_subsys.
  devcg: remove broken_hierarchy tag
  cgroup: remove cgroup_lock_is_held()
  ...
2013-04-29 19:14:20 -07:00
Linus Torvalds
46d9be3e5e Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "A lot of activities on workqueue side this time.  The changes achieve
  the followings.

   - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
     updated to be able to interface with multiple backend worker pools.
     This involved a lot of churning but the end result seems actually
     neater as unbound workqueues are now a lot closer to per-cpu ones.

   - The ability to interface with multiple backend worker pools are
     used to implement unbound workqueues with custom attributes.
     Currently the supported attributes are the nice level and CPU
     affinity.  It may be expanded to include cgroup association in
     future.  The attributes can be specified either by calling
     apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
     the workqueue in question is exported through sysfs.

     The backend worker pools are keyed by the actual attributes and
     shared by any workqueues which share the same attributes.  When
     attributes of a workqueue are changed, the workqueue binds to the
     worker pool with the specified attributes while leaving the work
     items which are already executing in its previous worker pools
     alone.

     This allows converting custom worker pool implementations which
     want worker attribute tuning to use workqueues.  The writeback pool
     is already converted in block tree and there are a couple others
     are likely to follow including btrfs io workers.

   - WQ_UNBOUND's ability to bind to multiple worker pools is also used
     to make it NUMA-aware.  Because there's no association between work
     item issuer and the specific worker assigned to execute it, before
     this change, using unbound workqueue led to unnecessary cross-node
     bouncing and it couldn't be helped by autonuma as it requires tasks
     to have implicit node affinity and workers are assigned randomly.

     After these changes, an unbound workqueue now binds to multiple
     NUMA-affine worker pools so that queued work items are executed in
     the same node.  This is turned on by default but can be disabled
     system-wide or for individual workqueues.

     Crypto was requesting NUMA affinity as encrypting data across
     different nodes can contribute noticeable overhead and doing it
     per-cpu was too limiting for certain cases and IO throughput could
     be bottlenecked by one CPU being fully occupied while others have
     idle cycles.

  While the new features required a lot of changes including
  restructuring locking, it didn't complicate the execution paths much.
  The unbound workqueue handling is now closer to per-cpu ones and the
  new features are implemented by simply associating a workqueue with
  different sets of backend worker pools without changing queue,
  execution or flush paths.

  As such, even though the amount of change is very high, I feel
  relatively safe in that it isn't likely to cause subtle issues with
  basic correctness of work item execution and handling.  If something
  is wrong, it's likely to show up as being associated with worker pools
  with the wrong attributes or OOPS while workqueue attributes are being
  changed or during CPU hotplug.

  While this creates more backend worker pools, it doesn't add too many
  more workers unless, of course, there are many workqueues with unique
  combinations of attributes.  Assuming everything else is the same,
  NUMA awareness costs an extra worker pool per NUMA node with online
  CPUs.

  There are also a couple things which are being routed outside the
  workqueue tree.

   - block tree pulled in workqueue for-3.10 so that writeback worker
     pool can be converted to unbound workqueue with sysfs control
     exposed.  This simplifies the code, makes writeback workers
     NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

   - The conversion to workqueue means that there's no 1:1 association
     between a specific worker, which makes writeback folks unhappy as
     they want to be able to tell which filesystem caused a problem from
     backtrace on systems with many filesystems mounted.  This is
     resolved by allowing work items to set debug info string which is
     printed when the task is dumped.  As this change involves unifying
     implementations of dump_stack() and friends in arch codes, it's
     being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
  workqueue: use kmem_cache_free() instead of kfree()
  workqueue: avoid false negative WARN_ON() in destroy_workqueue()
  workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
  workqueue: implement NUMA affinity for unbound workqueues
  workqueue: introduce put_pwq_unlocked()
  workqueue: introduce numa_pwq_tbl_install()
  workqueue: use NUMA-aware allocation for pool_workqueues
  workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  workqueue: map an unbound workqueues to multiple per-node pool_workqueues
  workqueue: move hot fields of workqueue_struct to the end
  workqueue: make workqueue->name[] fixed len
  workqueue: add workqueue->unbound_attrs
  workqueue: determine NUMA node of workers accourding to the allowed cpumask
  workqueue: drop 'H' from kworker names of unbound worker pools
  workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
  workqueue: fix memory leak in apply_workqueue_attrs()
  workqueue: fix unbound workqueue attrs hashing / comparison
  workqueue: fix race condition in unbound workqueue free path
  workqueue: remove pwq_lock which is no longer used
  ...
2013-04-29 19:07:40 -07:00
Linus Torvalds
ce8aa48929 Merge branch 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull async update from Tejun Heo:
 "This contains three cleanup patches for async from Lai.  All three
  patches are essentially cosmetic."

* 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  async: rename and redefine async_func_ptr
  async: remove unused @node from struct async_domain
  async: simplify lowest_in_progress()
2013-04-29 19:06:59 -07:00
Akinobu Mita
6d65df3325 kernel/: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:42 -07:00
Nicolas Kaiser
0a285317da printk: fix failure to return error in devkmsg_poll()
Error value got overwritten instantly.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:14 -07:00
Thomas Gleixner
d0380e6c3c early_printk: consolidate random copies of identical code
The early console implementations are the same all over the place.  Move
the print function to kernel/printk and get rid of the copies.

[akpm@linux-foundation.org: arch/mips/kernel/early_printk.c needs kernel.h for va_list]
[paul.gortmaker@windriver.com: sh4: make the bios early console support depend on EARLY_PRINTK]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:13 -07:00
zhangwei(Jovi)
07c65f4d1a printk/tracing: rework console tracing
Commit 7ff9554bb5 ("printk: convert byte-buffer to variable-length
record buffer") removed start and end parameters from
call_console_drivers, but those parameters still exist in
include/trace/events/printk.h.

Without start and end parameters handling, printk tracing became more
simple as: trace_console(text, len);

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:13 -07:00
Linus Torvalds
73154383f0 Merge branch 'akpm' (incoming from Andrew)
Merge first batch of fixes from Andrew Morton:

 - A couple of kthread changes

 - A few minor audit patches

 - A number of fbdev patches.  Florian remains AWOL so I'm picking up
   some of these.

 - A few kbuild things

 - ocfs2 updates

 - Almost all of the MM queue

(And in the meantime, I already have the second big batch from Andrew
pending in my mailbox ;^)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits)
  memcg: take reference before releasing rcu_read_lock
  mem hotunplug: fix kfree() of bootmem memory
  mmKconfig: add an option to disable bounce
  mm, nobootmem: do memset() after memblock_reserve()
  mm, nobootmem: clean-up of free_low_memory_core_early()
  fs/buffer.c: remove unnecessary init operation after allocating buffer_head.
  numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU
  mm: fix memory_hotplug.c printk format warning
  mm: swap: mark swap pages writeback before queueing for direct IO
  swap: redirty page if page write fails on swap file
  mm, memcg: give exiting processes access to memory reserves
  thp: fix huge zero page logic for page with pfn == 0
  memcg: avoid accessing memcg after releasing reference
  fs: fix fsync() error reporting
  memblock: fix missing comment of memblock_insert_region()
  mm: Remove unused parameter of pages_correctly_reserved()
  firmware, memmap: fix firmware_map_entry leak
  mm/vmstat: add note on safety of drain_zonestat
  mm: thp: add split tail pages to shrink page list in page reclaim
  mm: allow for outstanding swap writeback accounting
  ...
2013-04-29 17:29:08 -07:00
Yasuaki Ishimatsu
ebff7d8f27 mem hotunplug: fix kfree() of bootmem memory
When hot removing memory presented at boot time, following messages are shown:

  kernel BUG at mm/slub.c:3409!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
  CPU 0
  Pid: 5091, comm: kworker/0:2 Tainted: G        W    3.9.0-rc6+ #15
  RIP: kfree+0x232/0x240
  Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
  Call Trace:
    __release_region+0xd4/0xe0
    __remove_pages+0x52/0x110
    arch_remove_memory+0x89/0xd0
    remove_memory+0xc4/0x100
    acpi_memory_device_remove+0x6d/0xb1
    acpi_device_remove+0x89/0xab
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_device_detach+0x6c/0x70
    acpi_ns_walk_namespace+0x11a/0x250
    acpi_walk_namespace+0xee/0x137
    acpi_bus_trim+0x33/0x7a
    acpi_bus_hot_remove_device+0xc4/0x1a1
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x1f7/0x590
    worker_thread+0x11a/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
  RIP  [<ffffffff811c41d2>] kfree+0x232/0x240
   RSP <ffff88084678d968>

The reason why the messages are shown is to release a resource
structure, allocated by bootmem, by kfree().  So when we release a
resource structure, we should check whether it is allocated by bootmem
or not.

But even if we know a resource structure is allocated by bootmem, we
cannot release it since SLxB cannot treat it.  So for reusing a resource
structure, this patch remembers it by using bootmem_resource as follows:

When releasing a resource structure by free_resource(), free_resource()
checks whether the resource structure is allocated by bootmem or not.
If it is allocated by bootmem, free_resource() adds it to
bootmem_resource.  If it is not allocated by bootmem, free_resource()
release it by kfree().

And when getting a new resource structure by get_resource(),
get_resource() checks whether bootmem_resource has released resource
structures or not.  If there is a released resource structure,
get_resource() returns it.  If there is not a releaed resource
structure, get_resource() returns new resource structure allocated by
kzalloc().

[akpm@linux-foundation.org: s/get_resource/alloc_resource/]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:40 -07:00
Toshi Kani
825f787bb4 resource: add release_mem_region_adjustable()
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource.  This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.

This new interface is intended for memory hot-delete.  During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820.  Each memory resource entry usually covers
the whole contigous memory range.  Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory.  Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.

This new interface is restrictive (i.e.  release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource().  Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.

There is no change to the existing interfaces since their restriction is
valid for I/O resources.

[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Toshi Kani
ae8e3a915a resource: add __adjust_resource() for internal use
Add __adjust_resource(), which is called by adjust_resource() internally
after the resource_lock is held.  There is no interface change to
adjust_resource().  This change allows other functions to call
__adjust_resource() internally while the resource_lock is held.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Andrew Shewmaker
4eeab4f558 mm: replace hardcoded 3% with admin_reserve_pages knob
Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems.  Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.

This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

Please see first patch in this series for full background, motivation,
testing, and full changelog.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Andrew Shewmaker
c9b1d0981f mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.

Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages).  Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.

user_reserve_pages defaults to min(3% free pages, 128MB)

I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.

This only affects OVERCOMMIT_NEVER mode.

Background

1. user reserve

__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled.  This was done so that a
user could recover if they launched a memory hogging process.  Without the
reserve, a user would easily run into a message such as:

bash: fork: Cannot allocate memory

2. admin reserve

Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes.  This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.

Note that this reserve shrinks, and doesn't guarantee a useful reserve.

Motivation

The two hardcoded memory reserves should be updated to account for current
memory sizes.

Also, the admin reserve would be more useful if it didn't shrink too much.

When the current code was originally written, 1GB was considered
"enterprise".  Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.

I've found that reducing these reserves is especially beneficial for a
specific type of application load:

 * single application system
 * one or few processes (e.g. one per core)
 * allocating all available memory
 * not initializing every page immediately
 * long running

I've run scientific clusters with this sort of load.  A long running job
sometimes failed many hours (weeks of CPU time) into a calculation.  They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode.  These
clusters run diskless and have no swap.

However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.

The effect is less, but still significant when a user starts a job with
one process per core.  I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core.  And
it is similar for other parallel programming frameworks.

Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.

Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.

Risks

* "bash: fork: Cannot allocate memory"

  The downside of the first patch-- which creates a tunable user reserve
  that is only used in overcommit 'never' mode--is that an admin can set
  it so low that a user may not be able to kill their process, even if
  they already have a shell prompt.

  Of course, a user can get in the same predicament with the current 3%
  reserve--they just have to launch processes until 3% becomes negligible.

* root-cant-log-in problem

  The second patch, adding the tunable rootuser_reserve_pages, allows
  the admin to shoot themselves in the foot by setting it too small.  They
  can easily get the system into a state where root-can't-log-in.

  However, the new admin_reserve_kbytes will be safer than the current
  behavior since the hardcoded 3% of available memory reserve can shrink
  to something useless in the case where applications have grabbed all
  available memory.

Alternatives

 * Memory cgroups provide a more flexible way to limit application memory.

   Not everyone wants to set up cgroups or deal with their overhead.

 * We could create a fourth overcommit mode which provides smaller reserves.

   The size of useful reserves may be drastically different depending
   on the whether the system is embedded or enterprise.

 * Force users to initialize all of their memory or use calloc.

   Some users don't want/expect the system to overcommit when they malloc.
   Overcommit 'never' mode is for this scenario, and it should work well.

The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups.  The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure.  The code allows admins to tune
for embedded and enterprise usage.

FAQ

 * How is the root-cant-login problem addressed?
   What happens if admin_reserve_pages is set to 0?

   Root is free to shoot themselves in the foot by setting
   admin_reserve_kbytes too low.

   On x86_64, the minimum useful reserve is:
     8MB for overcommit 'guess'
   128MB for overcommit 'never'

   admin_reserve_pages defaults to min(3% free memory, 8MB)

   So, anyone switching to 'never' mode needs to adjust
   admin_reserve_pages.

 * How do you calculate a minimum useful reserve?

   A user or the admin needs enough memory to login and perform
   recovery operations, which includes, at a minimum:

   sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

   For overcommit 'guess', we can sum resident set sizes (RSS)
   because we only need enough memory to handle what the recovery
   programs will typically use. On x86_64 this is about 8MB.

   For overcommit 'never', we can take the max of their virtual sizes (VSZ)
   and add the sum of their RSS. We use VSZ instead of RSS because mode
   forces us to ensure we can fulfill all of the requested memory allocations--
   even if the programs only use a fraction of what they ask for.
   On x86_64 this is about 128MB.

   When swap is enabled, reserves are useful even when they are as
   small as 10MB, regardless of overcommit mode.

   When both swap and overcommit are disabled, then the admin should
   tune the reserves higher to be absolutley safe. Over 230MB each
   was safest in my testing.

 * What happens if user_reserve_pages is set to 0?

   Note, this only affects overcomitt 'never' mode.

   Then a user will be able to allocate all available memory minus
   admin_reserve_kbytes.

   However, they will easily see a message such as:

   "bash: fork: Cannot allocate memory"

   And they won't be able to recover/kill their application.
   The admin should be able to recover the system if
   admin_reserve_kbytes is set appropriately.

 * What's the difference between overcommit 'guess' and 'never'?

   "Guess" allows an allocation if there are enough free + reclaimable
   pages. It has a hardcoded 3% of free pages reserved for root.

   "Never" allows an allocation if there is enough swap + a configurable
   percentage (default is 50) of physical RAM. It has a hardcoded 3% of
   free pages reserved for root, like "Guess" mode. It also has a
   hardcoded 3% of the current process size reserved for additional
   applications.

 * Why is overcommit 'guess' not suitable even when an app eventually
   writes to every page? It takes free pages, file pages, available
   swap pages, reclaimable slab pages into consideration. In other words,
   these are all pages available, then why isn't overcommit suitable?

   Because it only looks at the present state of the system. It
   does not take into account the memory that other applications have
   malloced, but haven't initialized yet. It overcommits the system.

Test Summary

There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.

Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.

Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.

Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.

With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:

1. maximize user-allocatable memory, running close to the edge of
recoverability

2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system

Test Description

Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.

Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo

In overcommit 'never' mode, memory_ratio=100

Test Results

3.9.0-rc1-mm1

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
----------   ----   ----   -------------   ----   -------------   --------------
guess        yes    1      5432/5432       no     yes             yes
guess        yes    4      5444/5444       1      yes             yes
guess        no     1      5302/5449       no     yes             yes
guess        no     4      -               crash  no              no

never        yes    1      5460/5460       1      yes             yes
never        yes    4      5460/5460       1      yes             yes
never        no     1      5218/5432       no     no              yes
never        no     4      5203/5448       no     no              yes

3.9.0-rc1-mm1-tunablereserves

User and Admin Recovery show their respective reserves, if applicable.

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
----------   ----   ----   -------------   ----   -------------   --------------
guess        yes    1      5419/5419       no     - yes           8MB yes
guess        yes    4      5436/5436       1      - yes           8MB yes
guess        no     1      5440/5440       *      - yes           8MB yes
guess        no     4      -               crash  - no            8MB no

* process would successfully mlock, then the oom killer would pick it

never        yes    1      5446/5446       no     10MB yes        20MB yes
never        yes    4      5456/5456       no     10MB yes        20MB yes
never        no     1      5387/5429       no     128MB no        8MB barely
never        no     1      5323/5428       no     226MB barely    8MB barely
never        no     1      5323/5428       no     226MB barely    8MB barely

never        no     1      5359/5448       no     10MB no         10MB barely

never        no     1      5323/5428       no     0MB no          10MB barely
never        no     1      5332/5428       no     0MB no          50MB yes
never        no     1      5293/5429       no     0MB no          90MB yes

never        no     1      5001/5427       no     230MB yes       338MB yes
never        no     4*     4998/5424       no     230MB yes       338MB yes

* more memtesters were launched, able to allocate approximately another 100MB

Future Work

 - Test larger memory systems.

 - Test an embedded image.

 - Test other architectures.

 - Time malloc microbenchmarks.

 - Would it be useful to be able to set overcommit policy for
   each memory cgroup?

 - Some lines are slightly above 80 chars.
   Perhaps define a macro to convert between pages and kb?
   Other places in the kernel do this.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Andrew Morton
d8f10cb3d3 kernel/cpuset.c: use register_hotmemory_notifier()
Use the new interface, remove one ifdef.  No code size changes.

We could/should have been using __meminit/__meminitdata here but there's
now no point in doing that because all this code is elided at compile time.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Atsushi Kumagai
13ba3fcbbe kexec, vmalloc: export additional vmalloc layer information
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start).  The address which
contains vmalloc_start value is represented as below:

  vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)

However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.

So this patch exports them externally with small cleanup.

[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Joonsoo Kim
f1c4069e1d mm, vmalloc: export vmap_area_list, instead of vmlist
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec.  kexec dumps address of vmlist and
makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile.  For this purpose, we
export vmap_area_list, instead of vmlist.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Josh Triplett
146732ce10 fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n
drop_caches.c provides code only invokable via sysctl, so don't compile it
in when CONFIG_SYSCTL=n.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Michal Hocko
6d2488f64a cgroup: remove css_get_next
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Jiang Liu
e07cee23e6 mm,kexec: use common help functions to free reserved pages
Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:31 -07:00
Chen Gang
12b2f117f3 kernel/audit_tree.c: tree will leak memory when failure occurs in audit_trim_trees()
audit_trim_trees() calls get_tree().  If a failure occurs we must call
put_tree().

[akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Chen Gang
373e0f3408 kernel/auditfilter.c: tree and watch will memory leak when failure occurs
In audit_data_to_entry() when a failure occurs we must check and free
the tree and watch to avoid a memory leak.

  test:
    plan:
      test command:
        "auditctl -a exit,always -w /etc -F auid=-1"
        (on fedora17, need modify auditctl to let "-w /etc" has effect)
      running:
        under fedora17 x86_64, 2 CPUs 3.20GHz, 2.5GB RAM.
        let 15 auditctl processes continue running at the same time.
      monitor command:
        watch -d -n 1 "cat /proc/meminfo | awk '{print \$2}' \
          | head -n 4 | xargs \
          | awk '{print \"used \",\$1 - \$2 - \$3 - \$4}'"

    result:
      for original version:
        will use up all memory, within 3 hours.
        kill all auditctl, the memory still does not free.
      for new version (apply this patch):
        after 14 hours later, not find issues.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
dde5b7d6e7 audit: remove unnecessary #if CONFIG_AUDIT
The files which include kernel/audit.h are complied only when
CONFIG_AUDIT is set.

Just like audit_pid, there is no need to surround audit_ever_enabled
with CONFIG_AUDIT.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
374c586d95 audit: remove duplicate export of audit_enabled
audit_enabled has already been exported in include/linux/audit.h.  and
kernel/audit.h includes include/linux/audit.h, no need to export
aduit_enabled again in kernel/audit.h

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
13f51e1c3f audit: don't check if kauditd is valid every time
We only need to check if kauditd is valid after we start it, if kauditd
is invalid, we will set kauditd_task to NULL.  So next time, we will
start kauditd again.

It means if kauditd_task is not NULL,it must be valid.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Rakib Mullick
3f68613f39 kernel/auditsc.c: use kzalloc instead of kmalloc+memset
In audit_alloc_context() use kzalloc instead of kmalloc+memset.  Also
rename audit_zero_context() to audit_set_context(), to represent it's
inner workings properly.

[akpm@linux-foundation.org: remove audit_set_context() altogether - fold it into its caller]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Oleg Nesterov
b5c5442bb6 kthread: kill task_get_live_kthread()
task_get_live_kthread() looks confusing and unneeded.  It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().

kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference.  And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.

Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Oleg Nesterov
4ecdafc808 kthread: introduce to_live_kthread()
"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.

All we need is to ensure that we do not return to_kthread(NULL).  Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Linus Torvalds
9e8529afc4 Tracing updates for Linux 3.10
Along with the usual minor fixes and clean ups there are a few major
 changes with this pull request.
 
 1) Multiple buffers for the ftrace facility
 
 This feature has been requested by many people over the last few years.
 I even heard that Google was about to implement it themselves. I finally
 had time and cleaned up the code such that you can now create multiple
 instances of the ftrace buffer and have different events go to different
 buffers. This way, a low frequency event will not be lost in the noise
 of a high frequency event.
 
 Note, currently only events can go to different buffers, the tracers
 (ie. function, function_graph and the latency tracers) still can only
 be written to the main buffer.
 
 2) The function tracer triggers have now been extended.
 
 The function tracer had two triggers. One to enable tracing when a
 function is hit, and one to disable tracing. Now you can record a
 stack trace on a single (or many) function(s), take a snapshot of the
 buffer (copy it to the snapshot buffer), and you can enable or disable
 an event to be traced when a function is hit.
 
 3) A perf clock has been added.
 
 A "perf" clock can be chosen to be used when tracing. This will cause
 ftrace to use the same clock as perf uses, and hopefully this will make
 it easier to interleave the perf and ftrace data for analysis.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRfnTPAAoJEOdOSU1xswtMqYYH/1WIdrwXmxHflErnYkCIr3sU
 QtYae2K5A1HcgiqOvRJrdWMOt016iMx5CaQQyBFM1vvMiPY0sTWRmwNxDfZzz9LN
 10jRvWEzZSLtzl+a9mkFWLEpr5nR/QODOxkWFCnRWscp46sp04LSTxGDYsOnPQZB
 sam/AQ1h4xA+DqDBChm9BDEUEPorGleTlN54LBaCGgSFGvrbF+eAg2s4vHNAQAvQ
 8d5xjSE9zC7J+FqbVxvJTbKI3+EqKL6hMsJKsKfi0SI+FuxBaFMSltXck5zKyTI4
 HpNJzXCmw+v90Tju7oMkPHh6RTbESPCHoGU+wqE52fM6m7oScVeuI/kfc6USwU4=
 =W1n+
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Along with the usual minor fixes and clean ups there are a few major
  changes with this pull request.

   1) Multiple buffers for the ftrace facility

  This feature has been requested by many people over the last few
  years.  I even heard that Google was about to implement it themselves.
  I finally had time and cleaned up the code such that you can now
  create multiple instances of the ftrace buffer and have different
  events go to different buffers.  This way, a low frequency event will
  not be lost in the noise of a high frequency event.

  Note, currently only events can go to different buffers, the tracers
  (ie function, function_graph and the latency tracers) still can only
  be written to the main buffer.

   2) The function tracer triggers have now been extended.

  The function tracer had two triggers.  One to enable tracing when a
  function is hit, and one to disable tracing.  Now you can record a
  stack trace on a single (or many) function(s), take a snapshot of the
  buffer (copy it to the snapshot buffer), and you can enable or disable
  an event to be traced when a function is hit.

   3) A perf clock has been added.

  A "perf" clock can be chosen to be used when tracing.  This will cause
  ftrace to use the same clock as perf uses, and hopefully this will
  make it easier to interleave the perf and ftrace data for analysis."

* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
  tracepoints: Prevent null probe from being added
  tracing: Compare to 1 instead of zero for is_signed_type()
  tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
  ftrace: Get rid of ftrace_profile_bits
  tracing: Check return value of tracing_init_dentry()
  tracing: Get rid of unneeded key calculation in ftrace_hash_move()
  tracing: Reset ftrace_graph_filter_enabled if count is zero
  tracing: Fix off-by-one on allocating stat->pages
  kernel: tracing: Use strlcpy instead of strncpy
  tracing: Update debugfs README file
  tracing: Fix ftrace_dump()
  tracing: Rename trace_event_mutex to trace_event_sem
  tracing: Fix comment about prefix in arch_syscall_match_sym_name()
  tracing: Convert trace_destroy_fields() to static
  tracing: Move find_event_field() into trace_events.c
  tracing: Use TRACE_MAX_PRINT instead of constant
  tracing: Use pr_warn_once instead of open coded implementation
  ring-buffer: Add ring buffer startup selftest
  tracing: Bring Documentation/trace/ftrace.txt up to date
  tracing: Add "perf" trace_clock
  ...

Conflicts:
	kernel/trace/ftrace.c
	kernel/trace/trace.c
2013-04-29 13:55:38 -07:00
Al Viro
8e0bcc7222 fix a leak in /proc/schedstats
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-29 15:41:45 -04:00
Linus Torvalds
916bb6d76d Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking changes from Ingo Molnar:
 "The most noticeable change are mutex speedups from Waiman Long, for
  higher loads.  These scalability changes should be most noticeable on
  larger server systems.

  There are also cleanups, fixes and debuggability improvements."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Consolidate bug messages into a single print_lockdep_off() function
  lockdep: Print out additional debugging advice when we hit lockdep BUGs
  mutex: Back out architecture specific check for negative mutex count
  mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
  mutex: Make more scalable by doing less atomic operations
  mutex: Move mutex spinning code from sched/core.c back to mutex.c
  locking/rtmutex/tester: Set correct permissions on sysfs files
  lockdep: Remove unnecessary 'hlock_next' variable
2013-04-29 08:21:37 -07:00
Li Zhong
6296ace467 nohz: Protect smp_processor_id() in tick_nohz_task_switch()
I saw following error when testing the latest nohz code on
Power:

[   85.295384] BUG: using smp_processor_id() in preemptible [00000000] code: rsyslogd/3493
[   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
[   85.295402] Call Trace:
[   85.295408] [c0000001fababab0] [c000000000012dc4] .show_stack+0x110/0x25c (unreliable)
[   85.295420] [c0000001fababba0] [c0000000007c4b54] .dump_stack+0x20/0x30
[   85.295430] [c0000001fababc10] [c00000000044eb74] .debug_smp_processor_id+0xf4/0x124
[   85.295438] [c0000001fababca0] [c0000000000d7594] .tick_nohz_task_switch+0x1c/0xb8
[   85.295447] [c0000001fababd20] [c0000000000b9748] .finish_task_switch+0x13c/0x160
[   85.295455] [c0000001fababdb0] [c0000000000bbe50] .schedule_tail+0x50/0x124
[   85.295463] [c0000001fababe30] [c000000000009dc8] .ret_from_fork+0x4/0x54

The code below moves the test into local_irq_save/restore
section to avoid the above complaint.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1367119558.6391.34.camel@ThinkPad-T5421.cn.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-29 13:17:33 +02:00
Li Zefan
2a0010af17 cpuset: fix compile warning when CONFIG_SMP=n
Reported by Fengguang's kbuild test robot:

kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used

Introduced by commit e0e80a02e5
("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
which removed generate_sched_domains() from cpuset_hotplug_workfn().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 19:55:04 -07:00
Rafael J. Wysocki
355c63e5ac Merge branch 'pm-assorted'
* pm-assorted:
  PM / OPP: add documentation to RCU head in struct opp
  PM / sleep: invalidate TEST_CPUS and TEST_CORE support for freeze state
  PM / sleep: add TEST_PLATFORM support for freeze state
2013-04-28 01:54:29 +02:00
Linus Torvalds
e09d13c4c8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "This fix adds missing RCU read protection"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  events: Protect access via task_subsys_state_check()
2013-04-27 10:08:09 -07:00
Li Zefan
5b16c2a493 cpuset: fix cpu hotplug vs rebuild_sched_domains() race
rebuild_sched_domains() might pass doms with offlined cpu to
partition_sched_domains(), which results in an oops:

general protection fault: 0000 [#1] SMP
...
RIP: 0010:[<ffffffff81077a1e>]  [<ffffffff81077a1e>] get_group+0x6e/0x90
...
Call Trace:
 [<ffffffff8107f07c>] build_sched_domains+0x70c/0xcb0
 [<ffffffff8107f2a7>] ? build_sched_domains+0x937/0xcb0
 [<ffffffff81173f64>] ? kfree+0xe4/0x1b0
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff8107f905>] partition_sched_domains+0x2e5/0x470
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff810c9007>] ? generate_sched_domains+0xc7/0x530
 [<ffffffff810c94a8>] rebuild_sched_domains_locked+0x38/0x70
 [<ffffffff810cb4a4>] cpuset_write_resmask+0x1a4/0x500
 [<ffffffff810c8700>] ? cpuset_mount+0xe0/0xe0
 [<ffffffff810c7f50>] ? cpuset_read_u64+0x100/0x100
 [<ffffffff810be890>] ? cgroup_iter_next+0x90/0x90
 [<ffffffff810cb300>] ? cpuset_css_offline+0x70/0x70
 [<ffffffff810c1a73>] cgroup_file_write+0x133/0x2e0
 [<ffffffff8118995b>] vfs_write+0xcb/0x130
 [<ffffffff8118a174>] sys_write+0x64/0xa0

Reported-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Li Zhong
e0e80a02e5 cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).

This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Ingo Molnar
a1412ec539 Merge branch 'timers/nohz-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull more full-dynticks updates from Frederic Weisbecker:

 * Get rid of the passive dependency on VIRT_CPU_ACCOUNTING_GEN (finally!)
 * Preparation patch to remove the dependency on CONFIG_64BITS

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-27 09:16:06 +02:00
Li Zefan
7ef70e4873 cgroup: restore the call to eventfd->poll()
I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...

Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:03 -07:00
Li Zefan
cc20e01cd6 cgroup: fix use-after-free when umounting cgroupfs
Try:
  # mount -t cgroup xxx /cgroup
  # mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

And you might see this:

ida_remove called for id=1 which is not allocated.

It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:02 -07:00
Frederic Weisbecker
c58b0df12a nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
Turn the full dynticks passive dependency on VIRT_CPU_ACCOUNTING_GEN
to an active one.

The full dynticks Kconfig is currently hidden behind the full dynticks
cputime accounting, which is an awkward and counter-intuitive layout:
the user first has to select the dynticks cputime accounting in order
to make the full dynticks feature to be visible.

We definetly want it the other way around. The usual way to perform
this kind of active dependency is use "select" on the depended target.
Now we can't use the Kconfig "select" instruction when the target is
a "choice".

So this patch inspires on how the RCU subsystem Kconfig interact
with its dependencies on SMP and PREEMPT: we make sure that cputime
accounting can't propose another option than VIRT_CPU_ACCOUNTING_GEN
when NO_HZ_FULL is selected by using the right "depends on" instruction
for each cputime accounting choices.

v2: Keep full dynticks cputime accounting available even without
full dynticks, as per Paul McKenney's suggestion.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-26 18:56:59 +02:00
Vincent Guittot
25f55d9d01 sched: Fix init NOHZ_IDLE flag
On my SMP platform which is made of 5 cores in 2 clusters, I
have the nr_busy_cpu field of sched_group_power struct that is
not null when the platform is fully idle - which makes the
scheduler unhappy.

The root cause is:

During the boot sequence, some CPUs reach the idle loop and set
their NOHZ_IDLE flag while waiting for others CPUs to boot. But
the nr_busy_cpus field is initialized later with the assumption
that all CPUs are in the busy state whereas some CPUs have
already set their NOHZ_IDLE flag.

More generally, the NOHZ_IDLE flag must be initialized when new
sched_domains are created in order to ensure that NOHZ_IDLE and
nr_busy_cpus are aligned.

This condition can be ensured by adding a synchronize_rcu()
between the destruction of old sched_domains and the creation of
new ones so the NOHZ_IDLE flag will not be updated with old
sched_domain once it has been initialized. But this solution
introduces a additionnal latency in the rebuild sequence that is
called during cpu hotplug.

As suggested by Frederic Weisbecker, another solution is to have
the same rcu lifecycle for both NOHZ_IDLE and sched_domain
struct. A new nohz_idle field is added to sched_domain so both
status and sched_domain will share the same RCU lifecycle and
will be always synchronized. In addition, there is no more need
to protect nohz_idle against concurrent access as it is only
modified by 2 exclusive functions called by local cpu.

This solution has been prefered to the creation of a new struct
with an extra pointer indirection for sched_domain.

The synchronization is done at the cost of :

 - An additional indirection and a rcu_dereference for accessing nohz_idle.
 - We use only the nohz_idle field of the top sched_domain.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Cc: pjt@google.com
Cc: rostedt@goodmis.org
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
[ Fixed !NO_HZ build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 12:13:44 +02:00
Ingo Molnar
47aa8b6cbc nohz: Reduce overhead under high-freq idling patterns
One testbox of mine (Intel Nehalem, 16-way) uses MWAIT for its idle routine,
which apparently can break out of its idle loop rather frequently, with
high frequency.

In that case NO_HZ_FULL=y kernels show high ksoftirqd overhead and constant
context switching, because tick_nohz_stop_sched_tick() will, if
delta_jiffies == 0, mis-identify this as a timer event - activating the
TIMER_SOFTIRQ, which wakes up ksoftirqd.

Fix this by treating delta_jiffies == 0 the same way we treat other short
wakeups, delta_jiffies == 1.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 10:13:42 +02:00
Dave Jones
2c52283662 lockdep: Consolidate bug messages into a single print_lockdep_off() function
Also add some missing printk levels.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130425174002.GA26769@redhat.com
[ Tweaked the messages a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:37:22 +02:00
Dave Jones
199e371f59 lockdep: Print out additional debugging advice when we hit lockdep BUGs
We occasionally get reports of these BUGs being hit, and the
stack trace doesn't necessarily always tell us what we need to
know about why we are hitting those limits.

If users start attaching /proc/lock_stats to reports we may have
more of a clue what's going on.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:36:33 +02:00
Thomas Gleixner
6f7a05d701 clockevents: Set dummy handler on CPU_DEAD shutdown
Vitaliy reported that a per cpu HPET timer interrupt crashes the
system during hibernation. What happens is that the per cpu HPET timer
gets shut down when the nonboot cpus are stopped. When the nonboot
cpus are onlined again the HPET code sets up the MSI interrupt which
fires before the clock event device is registered. The event handler
is still set to hrtimer_interrupt, which then crashes the machine due
to highres mode not being active.

See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333

There is no real good way to avoid that in the HPET code. The HPET
code alrady has a mechanism to detect spurious interrupts when event
handler == NULL for a similar reason.

We can handle that in the clockevent/tick layer and replace the
previous functional handler with a dummy handler like we do in
tick_setup_new_device().

The original clockevents code did this in clockevents_exchange_device(),
but that got removed by commit 7c1e76897 (clockevents: prevent
clockevent event_handler ending up handler_noop) which forgot to fix
it up in tick_shutdown(). Same issue with the broadcast device.

Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Cc: 700333@bugs.debian.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-25 13:57:04 +02:00
Thomas Gleixner
6402c7dc2a Merge branch 'linus' into timers/core
Reason: Get upstream fixes before adding conflicting code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-24 20:33:54 +02:00
Frederic Weisbecker
65e709dc0c nohz: Remove full dynticks' superfluous dependency on RCU tree
Remove the dependency on (TREE_RCU || TREE_PREEMPT_RCU). The full
dynticks option already depends on SMP which implies
(whatever flavour of) RCU tree config anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-24 16:17:36 +02:00
Joonsoo Kim
e02e60c109 sched: Prevent to re-select dst-cpu in load_balance()
Commit 88b8dac0 makes load_balance() consider other cpus in its
group. But, in that, there is no code for preventing to
re-select dst-cpu. So, same dst-cpu can be selected over and
over.

This patch add functionality to load_balance() in order to
exclude cpu which is selected once. We prevent to re-select
dst_cpu via env's cpus, so now, env's cpus is a candidate not
only for src_cpus, but also dst_cpus.

With this patch, we can remove lb_iterations and
max_lb_iterations, because we decide whether we can go ahead or
not via env's cpus.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:46 +02:00
Joonsoo Kim
e6252c3ef4 sched: Rename load_balance_tmpmask to load_balance_mask
This name doesn't represent specific meaning.
So rename it to imply it's purpose.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:45 +02:00
Joonsoo Kim
d31980846f sched: Move up affinity check to mitigate useless redoing overhead
Currently, LBF_ALL_PINNED is cleared after affinity check is
passed. So, if task migration is skipped by small load value or
small imbalance value in move_tasks(), we don't clear
LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance().

Imbalance value is often so small that any tasks cannot be moved
to other cpus and, of course, this situation may be continued
after we change the target cpu. So this patch move up affinity
check code and clear LBF_ALL_PINNED before evaluating load value
in order to mitigate useless redoing overhead.

In addition, re-order some comments correctly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
cfc0311804 sched: Don't consider other cpus in our group in case of NEWLY_IDLE
Commit 88b8dac0 makes load_balance() consider other cpus in its
group, regardless of idle type. When we do NEWLY_IDLE balancing,
we should not consider it, because a motivation of NEWLY_IDLE
balancing is to turn this cpu to non idle state if needed. This
is not the case of other cpus. So, change code not to consider
other cpus for NEWLY_IDLE balancing.

With this patch, assign 'if (pulled_task) this_rq->idle_stamp =
0' in idle_balance() is corrected, because NEWLY_IDLE balancing
doesn't consider other cpus. Assigning to 'this_rq->idle_stamp'
is now valid.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
de5eb2dd7f sched: Explicitly cpu_idle_type checking in rebalance_domains()
After commit 88b8dac0, dst-cpu can be changed in load_balance(),
then we can't know cpu_idle_type of dst-cpu when load_balance()
return positive. So, add explicit cpu_idle_type checking.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
Joonsoo Kim
f1cd085810 sched: Change position of resched_cpu() in load_balance()
cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK.
So, there is possibility that we miss doing resched_cpu().
Correct it as changing position of resched_cpu()
before checking LBF_NEED_BREAK.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
David S. Miller
6e0895c2ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	drivers/net/ethernet/intel/igb/igb_main.c
	drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
	include/net/scm.h
	net/batman-adv/routing.c
	net/ipv4/tcp_input.c

The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.

The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.

An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.

Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.

Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.

Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22 20:32:51 -04:00
Frederic Weisbecker
cb41a29076 nohz: Add basic tracing
It's not obvious to find out why the full dynticks subsystem
doesn't always stop the tick: whether this is due to kthreads,
posix timers, perf events, etc...

These new tracepoints are here to help the user diagnose
the failures and test this feature.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 23:03:09 +02:00
Frederic Weisbecker
0637e02939 nohz: Select wide RCU nocb for full dynticks
It makes testing and implementation much easier as we
know in advance that all CPUs are RCU nocbs.

Also this prepares to remove the dynamic check for
nohz_full= boot mask to be a subset of rcu_nocbs=

Eventually this should also help removing the requirement
for the boot CPU to be outside the full dynticks range.

Suggested-by: Christoph Lameter <cl@linux.com>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 23:02:26 +02:00
Frederic Weisbecker
67826eae8c nohz: Disable the tick when irq resume in full dynticks CPU
Eventually try to disable tick on irq exit, now that the
fundamental infrastructure is in place.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:09 +02:00
Frederic Weisbecker
99e5ada940 nohz: Re-evaluate the tick for the new task after a context switch
When a task is scheduled in, it may have some properties
of its own that could make the CPU reconsider the need for
the tick: posix cpu timers, perf events, ...

So notify the full dynticks subsystem when a task gets
scheduled in and re-check the tick dependency at this
stage. This is done through a self IPI to avoid messing
up with any current lock scenario.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:07 +02:00
Frederic Weisbecker
5811d9963e nohz: Prepare to stop the tick on irq exit
Interrupt exit is a natural place to stop the tick: it happens
after all events happening before and during the irq which
are liable to update the dependency on the tick occured. Also
it makes sure that any check on tick dependency is well ordered
against dynticks kick IPIs.

Bring in the infrastructure that performs the tick dependency
checks on irq exit and shut it down if these checks show that we
can do it safely.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:05 +02:00
Frederic Weisbecker
9014c45d9e nohz: Implement full dynticks kick
Implement the full dynticks kick that is performed from
IPIs sent by various subsystems (scheduler, posix timers, ...)
when they want to notify about a new event that may
reconsider the dependency on the tick.

Most of the time, such an event end up restarting the tick.

(Part of the design with subsystems providing *_can_stop_tick()
helpers suggested by Peter Zijlstra a while ago).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:28:04 +02:00
Thomas Gleixner
77c675ba18 timekeeping: Update tk->cycle_last in resume
commit 7ec98e15aa (timekeeping: Delay update of clock->cycle_last)
forgot to update tk->cycle_last in the resume path. This results in a
stale value versus clock->cycle_last and prevents resume in the worst
case.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Linux-pm mailing list <linux-pm@lists.linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304211648150.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:17:51 +02:00
Frederic Weisbecker
ff442c51f6 nohz: Re-evaluate the tick from the scheduler IPI
The scheduler IPI is used by the scheduler to kick
full dynticks CPUs asynchronously when more than one
task are running or when a new timer list timer is
enqueued. This way the destination CPU can decide
to restart the tick to handle this new situation.

Now let's call that kick in the scheduler IPI.

(Reusing the scheduler IPI rather than implementing
a new IPI was suggested by Peter Zijlstra a while ago)

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:16:04 +02:00
Frederic Weisbecker
ce831b38ca sched: New helper to prevent from stopping the tick in full dynticks
Provide a new helper to be called from the full dynticks engine
before stopping the tick in order to make sure we don't stop
it when there is more than one task running on the CPU.

This way we make sure that the tick stays alive to maintain
fairness.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:08:04 +02:00
Frederic Weisbecker
9f3660c2c1 sched: Kick full dynticks CPU that have more than one task enqueued.
Kick the tick on full dynticks CPUs when they get more
than one task running on their queue. This makes sure that
local fairness is maintained by the tick on the destination.

This is done regardless of these tasks' class. We should
be able to be more clever in the future depending on these. eg:
a CPU that runs a SCHED_FIFO task doesn't need to maintain
fairness against local pending tasks of the fair class.

But keep things simple for now.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:06:54 +02:00
Frederic Weisbecker
026249ef10 perf: New helper to prevent full dynticks CPUs from stopping tick
Provide a new helper that help full dynticks CPUs to prevent
from stopping their tick in case there are events in the local
rotation list.

This way we make sure that perf_event_task_tick() is serviced
on demand.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22 19:59:39 +02:00
Frederic Weisbecker
12351ef8f9 perf: Kick full dynticks CPU if events rotation is needed
Kick the current CPU's tick by sending it a self IPI when
an event is queued on the rotation list and it is the first
element inserted. This makes sure that perf_event_task_tick()
works on full dynticks CPUs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22 19:59:37 +02:00
Frederic Weisbecker
6ac29178b4 posix_timers: Fix pre-condition to stop the tick on full dynticks
The test that checks if a CPU can stop its tick from posix CPU
timers angle was mistakenly inverted.

What we want is to prevent the tick from being stopped as long
as the current CPU's task runs a posix CPU timer.

Fix this.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 19:59:25 +02:00
Rusty Russell
f83b293366 kernel/hz.bc: ignore.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-22 07:09:06 -07:00
Linus Torvalds
3125929454 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix offcore_rsp valid mask for SNB/IVB
  perf: Treat attr.config as u64 in perf_swevent_init()
2013-04-21 10:25:42 -07:00
Vincent Guittot
642dbc39ab sched: Fix wrong rq's runnable_avg update with rt tasks
The current update of the rq's load can be erroneous when RT
tasks are involved.

The update of the load of a rq that becomes idle, is done only
if the avg_idle is less than sysctl_sched_migration_cost. If RT
tasks and short idle duration alternate, the runnable_avg will
not be updated correctly and the time will be accounted as idle
time when a CFS task wakes up.

A new idle_enter function is called when the next task is the
idle function so the elapsed time will be accounted as run time
in the load of the rq, whatever the average idle time is. The
function update_rq_runnable_avg is removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the
rq's load is not done when the rq exit idle state because CFS's
functions are not called. Then, the idle_balance, which is
called just before entering the idle function, updates the rq's
load and makes the assumption that the elapsed time since the
last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a
periodic RT task, is close to LOAD_AVG_MAX whatever the running
duration of the RT task is.

A new idle_exit function is called when the prev task is the
idle function so the elapsed time will be accounted as idle time
in the rq's load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: pjt@google.com
Cc: fweisbec@gmail.com
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 11:22:52 +02:00
Paul E. McKenney
c79aa0d965 events: Protect access via task_subsys_state_check()
The following RCU splat indicates lack of RCU protection:

[  953.267649] ===============================
[  953.267652] [ INFO: suspicious RCU usage. ]
[  953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted
[  953.267661] -------------------------------
[  953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
[  953.267669]
[  953.267669] other info that might help us debug this:
[  953.267669]
[  953.267675]
[  953.267675] rcu_scheduler_active = 1, debug_locks = 0
[  953.267680] 1 lock held by glxgears/1289:
[  953.267683]  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0
[  953.267700]
[  953.267700] stack backtrace:
[  953.267704] Call Trace:
[  953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable)
[  953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180
[  953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690
[  953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0
[  953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220
[  953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0
...

This commit therefore adds the required RCU read-side critical
section to perf_event_comm().

Reported-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Gustavo Luiz Duarte <gusld@br.ibm.com>
2013-04-21 11:21:39 +02:00
Ingo Molnar
a166fcf04d Merge branch 'timers/nohz-posix-timers-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull posix cpu timers handling on full dynticks from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 11:05:47 +02:00
Ingo Molnar
73e21ce28d Merge branch 'perf/urgent' into perf/core
Conflicts:
	arch/x86/kernel/cpu/perf_event_intel.c

Merge in the latest fixes before applying new patches, resolve the conflict.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 10:57:33 +02:00
Linus Torvalds
830ac8524f Merge branch 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kdump fixes from Peter Anvin:
 "The kexec/kdump people have found several problems with the support
  for loading over 4 GiB that was introduced in this merge cycle.  This
  is partly due to a number of design problems inherent in the way the
  various pieces of kdump fit together (it is pretty horrifically manual
  in many places.)

  After a *lot* of iterations this is the patchset that was agreed upon,
  but of course it is now very late in the cycle.  However, because it
  changes both the syntax and semantics of the crashkernel option, it
  would be desirable to avoid a stable release with the broken
  interfaces."

I'm not happy with the timing, since originally the plan was to release
the final 3.9 tomorrow.  But apparently I'm doing an -rc8 instead...

* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kexec: use Crash kernel for Crash kernel low
  x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
  x86, kdump: Retore crashkernel= to allocate under 896M
  x86, kdump: Set crashkernel_low automatically
2013-04-20 18:40:36 -07:00
Sahara
4c69e6ea41 tracepoints: Prevent null probe from being added
Somehow tracepoint_entry_add_probe() function allows a null probe function.
And, this may lead to unexpected results since the number of probe
functions in an entry can be counted by checking whether a probe is null
or not in the for-loop.
This patch prevents a null probe from being added.
In tracepoint_entry_remove_probe() function, checking probe parameter
within the for-loop is moved out for code efficiency, leaving the null probe
feature which removes all probe functions in the entry.

Link: http://lkml.kernel.org/r/1365991995-19445-1-git-send-email-kpark3469@gmail.com

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-19 19:59:49 -04:00
Frederic Weisbecker
555347f6c0 posix_timers: New API to prevent from stopping the tick when timers are running
Bring a new helper that the full dynticks infrastructure can
call in order to know if it can safely stop the tick from
the posix cpu timers subsystem point of view.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 16:31:40 +02:00
Frederic Weisbecker
a85721601a posix_timers: Kick full dynticks CPUs when a posix cpu timer is armed
Kick the full dynticks CPUs when a posix cpu timer is enqueued by
way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer().
This also include rescheduled firing timers.

This way they can re-evaluate the state of (and possibly restart) their
tick against the new expiry modification.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 16:30:13 +02:00
Frederic Weisbecker
f98823ac75 nohz: New option to default all CPUs in full dynticks range
Provide a new kernel config that defaults all CPUs to be part
of the full dynticks range, except the boot one for timekeeping.

This default setting is overriden by the nohz_full= boot option
if passed by the user.

This is helpful for those who don't need a finegrained range
of full dynticks CPU and also for automated testing.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:54:16 +02:00
Frederic Weisbecker
d1e43fa5f8 nohz: Ensure full dynticks CPUs are RCU nocbs
We need full dynticks CPU to also be RCU nocb so
that we don't have to keep the tick to handle RCU
callbacks.

Make sure the range passed to nohz_full= boot
parameter is a subset of rcu_nocbs=

The CPUs that fail to meet this requirement will be
excluded from the nohz_full range. This is checked
early in boot time, before any CPU has the opportunity
to stop its tick.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:54:04 +02:00
Frederic Weisbecker
0453b435df nohz: Force boot CPU outside full dynticks range
The timekeeping job must be able to run early on boot
because there may be some pre-SMP (and thus pre-initcalls )
components that rely on it. The IO-APIC is one such users
as it tests the timer health by watching jiffies progression.

Given that it happens before we know the initial online
set, we can't rely on it to select a timekeeper. We need
one before SMP time otherwise we simply crash on boot.

To fix this and keep things simple for now, force the boot CPU
outside of the full dynticks range in any case and do this early
on kernel parameter parsing time.

We might want a trickier solution later, expecially for aSMP
architectures that need to assign housekeeping tasks to arbitrary
low power CPUs.

But it's still first pass KISS time for now.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:53:14 +02:00
Waiman Long
cc189d2513 mutex: Back out architecture specific check for negative mutex count
Linus suggested that probably all the supported architectures can
allow a negative mutex count without incorrect behavior, so we can
then back out the architecture specific change and allow the
mutex count to go to any negative number. That should further
reduce contention for non-x86 architecture.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
2bd2c92cf0 mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
turned on) allow multiple tasks to spin on a single mutex
concurrently. A potential problem with the current approach is
that when the mutex becomes available, all the spinning tasks
will try to acquire the mutex more or less simultaneously. As a
result, there will be a lot of cacheline bouncing especially on
systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting
the mutex spinners into a queue so that only the first one in
the queue will try to acquire the mutex. This will reduce
contention and allow all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation.
This patch will add a new field into the mutex data structure
for holding the MCS lock. This expands the mutex size by 8 bytes
for 64-bit system and 4 bytes for 32-bit system. This overhead
will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the fserver
workloads to 1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
|  Configuration  | Mean JPM  | Mean JPM  |  Mean JPM   | % Change |
|                 | w/o patch | patch 1   | patches 1&2 |  1->1&2  |
+-----------------+------------------------------------------------+
|                 |              User Range 1100 - 2000            |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  227972   |  227237   |   305043    |  +34.2%  |
| 4 nodes, HT off |  393503   |  381558   |   394650    |   +3.4%  |
| 2 nodes, HT off |  334957   |  325240   |   338853    |   +4.2%  |
| 1 node , HT off |  198141   |  197972   |   198075    |   +0.1%  |
+-----------------+------------------------------------------------+
|                 |              User Range 200 - 1000             |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  282325   |  312870   |   332185    |   +6.2%  |
| 4 nodes, HT off |  390698   |  378279   |   393419    |   +4.0%  |
| 2 nodes, HT off |  336986   |  326543   |   340260    |   +4.2%  |
| 1 node , HT off |  197588   |  197622   |   197582    |    0.0%  |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable
change in performance.  Rather, there is a slight drop in
performance. This mutex spinning patch more than recovers the
lost performance and show a significant increase of +30% at high
user load with the full 8 nodes. Similar improvements were also
seen in a 3.8 kernel.

The table below shows the %time spent by different kernel
functions as reported by perf when running the fserver workload
at 1500 users with all 8 nodes.

+-----------------------+-----------+---------+-------------+
|        Function       |  % time   | % time  |   % time    |
|                       | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed    |  34.96%   | 34.91%  |   29.14%    |
| __write_lock_failed   |  10.14%   | 10.68%  |    7.51%    |
| mutex_spin_on_owner   |   3.62%   |  3.42%  |    2.33%    |
| mspin_lock            |    N/A    |   N/A   |    9.90%    |
| __mutex_lock_slowpath |   1.46%   |  0.81%  |    0.14%    |
| _raw_spin_lock        |   2.25%   |  2.50%  |    1.10%    |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as
shown by the __mutex_lock_slowpath figure) which help a little
bit. We saw only a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner
figure is now split out into an additional mspin_lock figure.
The time increases from 3.42% to 11.23%. It shows a great
reduction in contention among the spinners leading to a 30%
improvement. The time ratio 9.9/2.33=4.3 indicates that there
are on average 4+ spinners waiting in the spin_lock loop for
each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 &
2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |      0.0%     |     -0.8%      |     +0.6%       |
| five_sec     |     -0.3%     |     +0.8%      |     +0.8%       |
| high_systime |     +0.4%     |     +2.4%      |     +2.1%       |
| new_fserver  |     +0.1%     |    +14.1%      |    +34.2%       |
| shared       |     -0.5%     |     -0.3%      |     -0.4%       |
| short        |     -1.7%     |     -9.8%      |     -8.3%       |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in
performance probably due to the spinner locking and queuing
overhead.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
0dc8c730c9 mutex: Make more scalable by doing less atomic operations
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a
total of three atomic read-modify-write instructions will be
issued in rapid succession. This can cause a lot of cache
bouncing when many tasks are trying to acquire the mutex at the
same time.

This patch will reduce the number of atomic_xchg instructions
used by checking the counter value first before issuing the
instruction. The atomic_read() function is just a simple memory
read. The atomic_xchg() function, on the other hand, can be up
to 2 order of magnitude or even more in cost when compared with
atomic_read(). By using atomic_read() to check the value first
before calling atomic_xchg(), we can avoid a lot of unnecessary
cache coherency traffic. The only downside with this change is
that a task on the slow path will have a tiny bit less chance of
getting the mutex when competing with another task in the fast
path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed
before calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture
can allow any negative number to be used in the mutex count to
indicate that some tasks are waiting for the mutex. I am not so
sure if that is the case for the other architectures. So the
default is to avoid atomic_xchg() if the count has already been
set to -1. For x86, the check is modified to include all
negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the
high_systime workloads to 1/2/4/8 nodes with hyperthreading on
and off.

+-----------------+-----------+------------+----------+
|  Configuration  | Mean JPM  |  Mean JPM  | % Change |
|		  | w/o patch | with patch |	      |
+-----------------+-----------------------------------+
|		  |      User Range 1100 - 2000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |    36980   |   148590  | +301.8%  |
| 8 nodes, HT off |    42799   |   145011  | +238.8%  |
| 4 nodes, HT on  |    61318   |   118445  |  +51.1%  |
| 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
| 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
| 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
| 1 node , HT on  |   149042   |   147671  |   -0.9%  |
| 1 node , HT off |   126036   |   126533  |   +0.4%  |
+-----------------+-----------------------------------+
|		  |       User Range 200 - 1000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |   41525    |   122349  | +194.6%  |
| 8 nodes, HT off |   49866    |   124032  | +148.7%  |
| 4 nodes, HT on  |   66409    |   106984  |  +61.1%  |
| 4 nodes, HT off |  119880    |   130508  |   +8.9%  |
| 2 nodes, HT on  |  138003    |   133948  |   -2.9%  |
| 2 nodes, HT off |  132792    |   131997  |   -0.6%  |
| 1 node , HT on  |  116593    |   115859  |   -0.6%  |
| 1 node , HT off |  104499    |   104597  |   +0.1%  |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to
random nature of the subtests executed. So a difference of less
than +-5% may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant
drop-off at high node count.  The patch has practically no
impact on 1 and 2 nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function
by the high_systime workload at 1500 users for 2/4/8-node
configurations with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
|    8 nodes    |      65.34%     |      0.69%       |  -99%   |
|    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
|    2 nodes    |       0.41%     |      0.32%       |  -22%   |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads
(at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
| five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
| fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
| new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
| shared       |    +13.1%     |   +146.1%      |   +181.5%       |
| short        |     +7.4%     |     +5.0%      |     +4.2%       |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton: Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:35 +02:00
Waiman Long
41fcb9f230 mutex: Move mutex spinning code from sched/core.c back to mutex.c
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:34 +02:00
Li Zefan
712317ad97 cgroup: fix broken file xattrs
We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

  # mount -t cgroup -o xattr /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/test
  # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
  # rmdir /sys/fs/cgroup/test
  # umount /sys/fs/cgroup
  oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-18 23:11:40 -07:00
Linus Torvalds
6835039d7e Merge branch 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux
Pull user-namespace fixes from Andy Lutomirski.

* 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
  userns: Changing any namespace id mappings should require privileges
  userns: Check uid_map's opener's fsuid, not the current fsuid
  userns: Don't let unprivileged users trick privileged users into setting the id_map
2013-04-18 18:09:12 -07:00
Frederic Weisbecker
76c24fb054 nohz: New APIs to re-evaluate the tick on full dynticks CPUs
Provide two new helpers in order to notify the full dynticks CPUs about
some internal system changes against which they may reconsider the state
of their tick. Some practical examples include: posix cpu timers, perf tick
and sched clock tick.

For now the notifying handler, implemented through IPIs, is a stub
that will be implemented when we get the tick stop/restart infrastructure
in.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-18 18:53:34 +02:00
Linus Torvalds
0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Masami Hiramatsu
5c51543b0a kprobes: Fix a double lock bug of kprobe_mutex
Fix a double locking bug caused when debug.kprobe-optimization=0.
While the proc_kprobes_optimization_handler locks kprobe_mutex,
wait_for_kprobe_optimizer locks it again and that causes a double lock.
To fix the bug, this introduces different mutex for protecting
sysctl parameter and locks it in proc_kprobes_optimization_handler.
Of course, since we need to lock kprobe_mutex when touching kprobes
resources, that is done in *optimize_all_kprobes().

This bug was introduced by commit ad72b3bea7 ("kprobes: fix
wait_for_kprobe_optimizer()")

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 08:58:38 -07:00
Thomas Gleixner
d2054b2c11 posix-timers: Remove unused variable
Remove the unused variable *node introduced by commit 5ed67f05 (posix
timers: Allocate timer id per process)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
2013-04-18 12:51:19 +02:00
Emese Revfy
b9e146d8eb kernel/signal.c: stop info leak via the tkill and the tgkill syscalls
This fixes a kernel memory contents leak via the tkill and tgkill syscalls
for compat processes.

This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field
when handling signals delivered from tkill.

The place of the infoleak:

int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
{
        ...
        put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr);
        ...
}

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Reviewed-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:45 -07:00
Yinghai Lu
157752d84f kexec: use Crash kernel for Crash kernel low
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem
instead.

So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem.

Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:34 -07:00
Yinghai Lu
adbc742bf7 x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.

-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
	description in kernel-parameters.txt
     still keep crashkernel=X override any crashkernel=X,high
        crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
     checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
     also separate parse_suffix from parse_simper according to vivek.
	so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
     found by HTYAYAMA.

Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Yinghai Lu
55a20ee780 x86, kdump: Retore crashkernel= to allocate under 896M
Vivek found old kexec-tools does not work new kernel anymore.

So change back crashkernel= back to old behavoir, and add crashkernel_high=
to let user decide if buffer could be above 4G, and also new kexec-tools will
be needed.

-v2: let crashkernel=X override crashkernel_high=
    update description about _high will be ignored by crashkernel=X
-v3: update description about kernel-parameters.txt according to Vivek.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Stephen Boyd
c038c1c441 clockevents: Switch into oneshot mode even if broadcast registered late
tick_oneshot_notify() is used to notify a particular CPU to try
to switch into oneshot mode after a oneshot capable tick device
is registered and tick_clock_notify() is used to notify all CPUs
to try to switch into oneshot mode after a high res clocksource
is registered. There is one caveat; if the tick devices suffer
from FEAT_C3_STOP we don't try to switch into oneshot mode unless
we have a oneshot capable broadcast device already registered.

If the broadcast device is registered after the tick devices that
have FEAT_C3_STOP we'll never try to switch into oneshot mode
again, causing us to be stuck in periodic mode forever. Avoid
this scenario by calling tick_clock_notify() after we register
the broadcast device so that we try to switch into oneshot mode
on all CPUs one more time.

[ tglx: Adopted to timers/core and added a comment ]

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1366219566-29783-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 21:30:56 +02:00
Nathan Zimmer
b3956a896e timer_list: Convert timer list to be a proper seq_file
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition.  On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer.  The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.

Convert /proc/timer_list to a proper seq_file with its own iterator.  This
is a little more complex given that we have to make two passes with two
separate headers.

sysrq_timer_list_show also needed to be updated to reflect the fact that
now timer_list_show only does one cpu at at time.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-3-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:02 +02:00
Nathan Zimmer
60cf7ea849 timer_list: Split timer_list_show_tickdevices
Split timer_list_show_tickdevices() into the header printout and pull
the rest up to timer_list_show. This is a preparatory patch for
converting timer_list to a proper seqfile with its own iterator

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-2-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:02 +02:00
Pavel Emelyanov
5ed67f05f6 posix timers: Allocate timer id per process (v2)
Currently kernel generates IDs for posix timers in a global manner --
there's a kernel-wide IDR tree from which IDs are created. This makes
it impossible to recreate a timer with a desired ID (in particular
this is done by the CRIU checkpoint-restore project) -- since these
IDs are global it may happen, that at the time we recreate a timer, the
ID we want for it is already busy by some other timer.

In order to address this, replace the IDR tree with a global hash
table for timers and makes timer IDs unique per signal_struct (to
which timers are linked anyway). With this, two timers belonging to
different processes may have equal IDs and we can recreate either of
them with the ID we want.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513D9FF5.9010004@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:01 +02:00
Thomas Gleixner
d190e8195b idle: Remove GENERIC_IDLE_LOOP config switch
All archs are converted over. Remove the config switch and the
fallback code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 10:39:38 +02:00
Rusty Russell
944a1fa012 module: don't unlink the module until we've removed all exposure.
Otherwise we get a race between unload and reload of the same module:
the new module doesn't see the old one in the list, but then fails because
it can't register over the still-extant entries in sysfs:

 [  103.981925] ------------[ cut here ]------------
 [  103.986902] WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xab/0xd0()
 [  103.993606] Hardware name: CrownBay Platform
 [  103.998075] sysfs: cannot create duplicate filename '/module/pch_gbe'
 [  104.004784] Modules linked in: pch_gbe(+) [last unloaded: pch_gbe]
 [  104.011362] Pid: 3021, comm: modprobe Tainted: G        W    3.9.0-rc5+ #5
 [  104.018662] Call Trace:
 [  104.021286]  [<c103599d>] warn_slowpath_common+0x6d/0xa0
 [  104.026933]  [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
 [  104.031986]  [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
 [  104.037000]  [<c1035a4e>] warn_slowpath_fmt+0x2e/0x30
 [  104.042188]  [<c1168c8b>] sysfs_add_one+0xab/0xd0
 [  104.046982]  [<c1168dbe>] create_dir+0x5e/0xa0
 [  104.051633]  [<c1168e78>] sysfs_create_dir+0x78/0xd0
 [  104.056774]  [<c1262bc3>] kobject_add_internal+0x83/0x1f0
 [  104.062351]  [<c126daf6>] ? kvasprintf+0x46/0x60
 [  104.067231]  [<c1262ebd>] kobject_add_varg+0x2d/0x50
 [  104.072450]  [<c1262f07>] kobject_init_and_add+0x27/0x30
 [  104.078075]  [<c1089240>] mod_sysfs_setup+0x80/0x540
 [  104.083207]  [<c1260851>] ? module_bug_finalize+0x51/0xc0
 [  104.088720]  [<c108ab29>] load_module+0x1429/0x18b0

We can teardown sysfs first, then to be sure, put the state in
MODULE_STATE_UNFORMED so it's ignored while we deconstruct it.

Reported-by: Veaceslav Falico <vfalico@redhat.com>
Tested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-17 13:23:02 +09:30
Eric Paris
62062cf8a3 audit: allow checking the type of audit message in the user filter
When userspace sends messages to the audit system it includes a type.
We want to be able to filter messages based on that type without have to
do the all or nothing option currently available on the
AUDIT_FILTER_TYPE filter list.  Instead we should be able to use the
AUDIT_FILTER_USER filter list and just use the message type as one part
of the matching decision.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-16 17:28:49 -04:00
Eric Paris
34c474de7b audit: fix build break when AUDIT_DEBUG == 2
Looks like this one has been around since 5195d8e21:

	kernel/auditsc.c: In function ‘audit_free_names’:
	kernel/auditsc.c:998: error: ‘i’ undeclared (first use in this function)

...and this warning:

	kernel/auditsc.c: In function ‘audit_putname’:
	kernel/auditsc.c:2045: warning: ‘i’ may be used uninitialized in this function

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-16 10:17:02 -04:00
Ingo Molnar
b5210b2a34 Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core
Pull uprobes updates from Oleg Nesterov:

 - "uretprobes" - an optimization to uprobes, like kretprobes are an optimization
   to kprobes. "perf probe -x file sym%return" now works like kretprobes.

 - PowerPC fixes plus a couple of cleanups/optimizations in uprobes and trace_uprobes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-16 11:04:10 +02:00
Paul E. McKenney
65d798f0f9 rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods
Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
not necessarily turn the scheduler-clock tick back on.  This state of
affairs could result in RCU waiting on an adaptive-ticks CPU running
for an extended period in kernel mode.  Such a CPU will never run the
RCU state machine, and could therefore indefinitely extend the RCU state
machine, sooner or later resulting in an OOM condition.

This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
causes RCU's force-quiescent-state processing to check for this condition
and to send an IPI to CPUs that remain in that state for too long.
"Too long" currently means about three jiffies by default, which is
quite some time for a CPU to remain in the kernel without blocking.
The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
sysfs variables may be used to tune "too long" if needed.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:18:36 +02:00
Frederic Weisbecker
fae30dd669 nohz: Improve a bit the full dynticks Kconfig documentation
Remove the "single task" statement from CONFIG_NO_HZ_FULL
title. The constraint can be invalidated when tasks from
other sched classes than SCHED_FAIR are running. Moreover
it's possible that hrtick join the party in the future.

Also add a line about the dependency on SMP.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:18:04 +02:00
Frederic Weisbecker
5b533f4ff5 nohz: Align periodic tick Kconfig with other choices' naming convention
Rename CONFIG_PERIODIC_HZ to CONFIG_HZ_PERIODIC in
order to stay consistent with other tick implementation
entries:

	CONFIG_HZ_PERIODIC
	CONFIG_NO_HZ_IDLE
	CONFIG_NO_HZ_FULL

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:11:26 +02:00
Frederic Weisbecker
c5bfece2d6 nohz: Switch from "extended nohz" to "full nohz" based naming
"Extended nohz" was used as a naming base for the full dynticks
API and Kconfig symbols. It reflects the fact the system tries
to stop the tick in more places than just idle.

But that "extended" name is a bit opaque and vague. Rename it to
"full" makes it clearer what the system tries to do under this
config: try to shutdown the tick anytime it can. The various
constraints that prevent that to happen shouldn't be considered
as fundamental properties of this feature but rather technical
issues that may be solved in the future.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 19:58:17 +02:00
Frederic Weisbecker
0644ca5c77 nohz: Fix old dynticks idle Kconfig backward compatibility
In order to enforce backward compatibility with older
config files, we want the new dynticks-idle Kconfig entry
to default its value to the one of the old CONFIG_NO_HZ symbol
if present.

Namely we want:

	config NO_HZ # old obsolete dynticks idle symbol
		bool

	config NO_HZ_IDLE # new dynticks idle symbol
		default NO_HZ

However Kconfig prevents this to work if the old symbol
is not visible. And this is currently the case because
NO_HZ lacks a title in order to show it in make oldconfig
and alike.

To fix this, bring a minimal title and help text to the
obsolete Kconfig entry that explains its purpose. This
makes the "defaulting" to work.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 19:39:40 +02:00
Oleg Nesterov
515619f209 uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change uprobe_perf_print()
to return if hlist_empty(call->perf_events).

Note: this is not uprobe-specific, we can change other users too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-15 17:39:52 +02:00
Linus Torvalds
bb33db7a07 Merge branches 'timers-urgent-for-linus', 'irq-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull {timer,irq,core} fixes from Thomas Gleixner:

 - timer: bug fix for a cpu hotplug race.

 - irq: single bugfix for a wrong return value, which prevents the
   calling function to invoke the software fallback.

 - core: bugfix which plugs two race confitions which can cause hotplug
   per cpu threads to end up on the wrong cpu.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Don't reinitialize a cpu_base lock on CPU_UP

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: gic: fix irq_trigger return

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-15 07:03:01 -07:00
Borislav Petkov
bec1b9e763 extable: Flip the sorting message
Now that we do sort the __extable at build time, we actually are
interested only in the case where we still do need to sort it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/1366023109-12098-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 13:25:16 +02:00
Tommi Rantala
8176cced70 perf: Treat attr.config as u64 in perf_swevent_init()
Trinity discovered that we fail to check all 64 bits of
attr.config passed by user space, resulting to out-of-bounds
access of the perf_swevent_enabled array in
sw_perf_event_destroy().

Introduced in commit b0a873ebb ("perf: Register PMU
implementations").

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 11:42:12 +02:00
Li Zefan
05fb22ec54 cgroup: remove cgrp->top_cgroup
It's not used, and it can be retrieved via cgrp->root->top_cgroup.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-14 23:26:10 -07:00
Chen Gang
e3f26752f0 kernel: kallsyms: memory override issue, need check destination buffer length
We don't export any symbols > 128 characters, but if we did then
  kallsyms_expand_symbol() would overflow the buffer handed to it.
  So we need check destination buffer length when copying.

  the related test:
    if we define an EXPORT function which name more than 128.
    will panic when call kallsyms_lookup_name by init_kprobes on booting.
    after check the length (provide this patch), it is ok.

  Implementaion:
    add additional destination buffer length parameter (maxlen)
    if uncompressed string is too long (>= maxlen), it will be truncated.
    not check the parameters whether valid, since it is a static function.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-15 15:17:26 +09:30
Tejun Heo
873fe09ea5 cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
2013-04-14 20:15:26 -07:00
Tejun Heo
25a7e6848d move cgroupfs_root to include/linux/cgroup.h
While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly.  This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers.  Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-14 20:15:25 -07:00
Tejun Heo
9343862945 cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags.  Convert them to masks.  Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.

Note that flags are assigned from (1 << 1).  The first bit will be
used by a flag which will be added soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-14 20:15:25 -07:00
Andy Lutomirski
41c21e351e userns: Changing any namespace id mappings should require privileges
Changing uid/gid/projid mappings doesn't change your id within the
namespace; it reconfigures the namespace.  Unprivileged programs should
*not* be able to write these files.  (We're also checking the privileges
on the wrong task.)

Given the write-once nature of these files and the other security
checks, this is likely impossible to usefully exploit.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:32 -07:00
Andy Lutomirski
e3211c120a userns: Check uid_map's opener's fsuid, not the current fsuid
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:31 -07:00
Eric W. Biederman
6708075f10 userns: Don't let unprivileged users trick privileged users into setting the id_map
When we require privilege for setting /proc/<pid>/uid_map or
/proc/<pid>/gid_map no longer allow an unprivileged user to
open the file and pass it to a privileged program to write
to the file.

Instead when privilege is required require both the opener and the
writer to have the necessary capabilities.

I have tested this code and verified that setting /proc/<pid>/uid_map
fails when an unprivileged user opens the file and a privielged user
attempts to set the mapping, that unprivileged users can still map
their own id, and that a privileged users can still setup an arbitrary
mapping.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:14 -07:00
Linus Torvalds
af788e35bf Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixlets"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime: Fix accounting on multi-threaded processes
  sched/debug: Fix sd->*_idx limit range avoiding overflow
  sched_clock: Prevent 64bit inatomicity on 32bit systems
  sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s
2013-04-14 11:12:17 -07:00
Linus Torvalds
ae9f4939ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix error return code
  ftrace: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, always make sure it's NUL terminated
  perf: Fix ring_buffer perf_output_space() boundary calculation
  perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer()
2013-04-14 11:10:44 -07:00
Linus Torvalds
3c91930f0c Namhyung Kim found and fixed a bug that can crash the kernel by simply
doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid
 
 Luckily, this can only be done by root, but still is a nasty bug.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRaK2+AAoJEOdOSU1xswtMw48IAJPcSNMl1+epx5cPw8pwf+y6
 YYvs/Ud3BMPBL+mpNPGNFWY+dWJsAtCtAgkLi0WgdL+b9iPNZrmQqqcP5xWV4uKV
 vRX2SPCQcyEn5keNnFdN3fN1R0+Gj4V8kLvxPqugzNrO9EHejx+TJFWjrONzkcSy
 g90lY45jfGWW0OS4GuSwHFhKDgcx8/kgb4Whv+xrKzTuX2QkU1BhG9WPsjiHWiL5
 WRYjC4LWafrWaPd4cIkzMqj1eU/hL8BkiLLQHM1Tw8yD7t8OPzgmuJMZEh6Cx1iW
 /Xrm5QkNEcqQ/vSAC6aWUi22VEgRYDLg8WjngwuMgY1Qa3LE2ex8cUDyk7lJbas=
 =SFA8
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "Namhyung Kim found and fixed a bug that can crash the kernel by simply
  doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid

  Luckily, this can only be done by root, but still is a nasty bug."

* tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section
  tracing: Fix possible NULL pointer dereferences
2013-04-14 10:50:55 -07:00
Tejun Heo
da1f296fd2 cgroup: make cgroup_path() not print double slashes
While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
cgroup_path() vs rename() race") introduced a bug where the path of a
non-root cgroup would have two slahses at the beginning, which is
caused by treating the root cgroup which has the name '/' like
non-root cgroups.

 $ grep systemd /proc/self/cgroup
 1:name=systemd://user/root/1

Fix it by special casing root cgroup case and not looping over it in
the normal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
2013-04-14 10:47:02 -07:00
Linus Torvalds
935d8aabd4 Add file_ns_capable() helper function for open-time capability checking
Nothing is using it yet, but this will allow us to delay the open-time
checks to use time, without breaking the normal UNIX permission
semantics where permissions are determined by the opener (and the file
descriptor can then be passed to a different process, or the process can
drop capabilities).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-14 10:06:31 -07:00
Oleg Nesterov
32520b2c69 uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobe_perf_print() passes addr=ip to perf_trace_buf_submit() for
no reason. This sets perf_sample_data->addr for PERF_SAMPLE_ADDR,
we already have perf_sample_data->ip initialized if PERF_SAMPLE_IP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-13 15:32:04 +02:00
Oleg Nesterov
4ee5a52ed6 uprobes/tracing: Change create_trace_uprobe() to support uretprobes
Finally change create_trace_uprobe() to check if argv[0][0] == 'r'
and pass the correct "is_ret" to alloc_trace_uprobe().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
3ede82dd3e uprobes/tracing: Make seq_printf() code uretprobe-friendly
Change probes_seq_show() and print_uprobe_event() to check
is_ret_probe() and print the correct data.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
4d1298e212 uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
Change uprobe_event_define_fields(), and __set_print_fmt() to check
is_ret_probe() and use the appropriate format/fields.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
393a736c28 uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
Change uprobe_trace_print() and uprobe_perf_print() to check
is_ret_probe() and fill ring_buffer_event accordingly.

Also change uprobe_trace_func() and uprobe_perf_func() to not
_print() if is_ret_probe() is true. Note that we keep ->handler()
nontrivial even for uretprobe, we need this for filtering and for
other potential extensions.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
c1ae5c75e1 uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
Create the new functions we need to support uretprobes, and change
alloc_trace_uprobe() to initialize consumer.ret_handler if the new
"is_ret" argument is true. Curently this argument is always false,
so the new code is never called and is_ret_probe(tu) is false too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:02 +02:00