The 'noprefix' option was introduced for backwards-compatibility of
cpuset, but actually it can be used when mounting other subsystems.
This results in possibility of name collision, and now the collision can
really happen, because we have 'stat' file in both memory and cpuacct
subsystem:
# mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt
Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory
subsys can be seen.
We don't want users to use nopreifx, and also want to avoid name
collision, so we change to allow noprefix only if mounting just the cpuset
subsystem.
[akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Statistics for softirq doesn't exist.
It will be helpful like statistics for interrupts.
This patch introduces counting the number of softirq,
which will be exported in /proc/softirqs.
When softirq handler consumes much CPU time,
/proc/stat is like the following.
$ while :; do cat /proc/stat | head -n1 ; sleep 10 ; done
cpu 88 0 408 739665 583 28 2 0 0
cpu 450 0 1090 740970 594 28 1294 0 0
^^^^
softirq
In such a situation,
/proc/softirqs shows us which softirq handler is invoked.
We can see the increase rate of softirqs.
<before>
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI 0 0 0 0
TIMER 462850 462805 462782 462718
NET_TX 0 0 0 365
NET_RX 2472 2 2 40
BLOCK 0 0 381 1164
TASKLET 0 0 0 224
SCHED 462654 462689 462698 462427
RCU 3046 2423 3367 3173
<after>
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI 0 0 0 0
TIMER 463361 465077 465056 464991
NET_TX 53 0 1 365
NET_RX 3757 2 2 40
BLOCK 0 0 398 1170
TASKLET 0 0 0 224
SCHED 463074 464318 464612 463330
RCU 3505 2948 3947 3673
When CPU TIME of softirq is high,
the rates of increase is the following.
TIMER : 220/sec : CPU1-3
NET_TX : 5/sec : CPU0
NET_RX : 120/sec : CPU0
SCHED : 40-200/sec : all CPU
RCU : 45-58/sec : all CPU
The rates of increase in an idle mode is the following.
TIMER : 250/sec
SCHED : 250/sec
RCU : 2/sec
It seems many softirqs for receiving packets and rcu are invoked. This
gives us help for checking system.
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: start using hrtimers
hrtimer: export ktime_add_safe
UBIFS: do not forget to register BDI device
UBIFS: allow sync option in rootflags
UBIFS: remove dead code
UBIFS: use anonymous device
UBIFS: return proper error code if the compr is not present
UBIFS: return error if link and unlink race
UBIFS: reset no_space flag after inode deletion
* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...
Manually fix up conflicts due to kmemcheck in mm/slab.c
Round the slow work queue's cull and OOM timeouts to whole second boundary
with round_jiffies(). The slow work queue uses a pair of timers to cull
idle threads and, after OOM, to delay new thread creation.
This patch also extracts the mod_timer() logic for the cull timer into a
separate helper function.
By rounding non-time-critical timers such as these to whole seconds, they
will be batched up to fire at the same time rather than being spread out.
This allows the CPU wake up less, which saves power.
Signed-off-by: Chris Peterson <cpeterso@cpeterso.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move supplementary groups implementation to kernel/groups.c .
kernel/sys.c already accumulated quite a few random stuff.
Do strictly copy/paste + add required headers to compile. Compile-tested
on many configs and archs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the memory
shrinking carried out by the hibernation code, it is theoretically
possible to trigger during suspend after the memory shrinking has been
removed from that code path. Moreover, since memory allocations are
going to be used for the hibernation memory shrinking, it will be even
more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled switch
that will cause __alloc_pages_internal() to fail in the situations in
which the OOM killer would have been called and make the freezer set
this switch after tasks have been successfully frozen.
[akpm@linux-foundation.org: be nicer to the namespace]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in
fast paths know for a fact their node is valid. To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON(). Callers that know their node is valid are then
converted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the bug that the kernel didn't spread page cache/slab object evenly
over all the allowed nodes when spread flags were set by updating tasks'
page/slab spread flags in time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel still allocates the page caches on old node after modifying its
cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
page cache evenly over all the nodes that faulting task is allowed to usr
after memory_spread_page was set. it is caused by the old mem_allowed and
flags of the task, the current kernel doesn't updates them unless some
function invokes cpuset_update_task_memory_state(), it is too late
sometimes.We must update the mem_allowed and the flags of the tasks in
time.
Slab has the same problem.
The following patches fix this bug by updating tasks' mem_allowed and
spread flag after its cpuset's mems or spread flag is changed.
This patch:
Extract a function from cpuset_update_task_memory_state(). It will be
used later for update tasks' page/slab spread flags after its cpuset's
flag is set
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
debugfs: use specified mode to possibly mark files read/write only
debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
xen: remove driver_data direct access of struct device from more drivers
usb: gadget: at91_udc: remove driver_data direct access of struct device
uml: remove driver_data direct access of struct device
block/ps3: remove driver_data direct access of struct device
s390: remove driver_data direct access of struct device
parport: remove driver_data direct access of struct device
parisc: remove driver_data direct access of struct device
of_serial: remove driver_data direct access of struct device
mips: remove driver_data direct access of struct device
ipmi: remove driver_data direct access of struct device
infiniband: ehca: remove driver_data direct access of struct device
ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
hvcs: remove driver_data direct access of struct device
xen block: remove driver_data direct access of struct device
thermal: remove driver_data direct access of struct device
scsi: remove driver_data direct access of struct device
pcmcia: remove driver_data direct access of struct device
PCIE: remove driver_data direct access of struct device
...
Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
Several WARN_ON() messages omit the '\n' at the end of the string, which
is a simple (and understandable) error. The next line printed after
that warning line is usually the current module list, and that printk
does not have a log-level marker - resulting in one long mixed-up line.
Adding this loglevel marker will now avoid this unreadable mess.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a KERN_DEFAULT loglevel marker, for when you cannot decide
which loglevel you want, and just want to keep an existing printk
with the default loglevel.
The difference between having KERN_DEFAULT and having no log-level
marker at all is two-fold:
- having the log-level marker will now force a new-line if the
previous printout had not added one (perhaps because it forgot,
but perhaps because it expected a continuation)
- having a log-level marker is required if you are printing out a
message that otherwise itself could perhaps otherwise be mistaken
for a log-level.
Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
It used to be that we would only look at the log-level in a printk()
after explicit newlines, which can cause annoying problems when the
previous printk() did not end with a '\n'. In that case, the log-level
marker would be just printed out in the middle of the line, and be
seen as just noise rather than change the logging level.
This changes things to always look at the log-level in the first
bytes of the printout. If a log level marker is found, it is always
used as the log-level. Additionally, if no newline existed, one is
added (unless the log-level is the explicit KERN_CONT marker, to
explicitly show that it's a continuation of a previous line).
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many developers use "/debug/" or "/debugfs/" or "/sys/kernel/debug/"
directory name to mount debugfs filesystem for ftrace according to
./Documentation/tracers/ftrace.txt file.
And, three directory names(ex:/debug/, /debugfs/, /sys/kernel/debug/) is
existed in kernel source like ftrace, DRM, Wireless, Documentation,
Network[sky2]files to mount debugfs filesystem.
debugfs means debug filesystem for debugging easy to use by greg kroah
hartman. "/sys/kernel/debug/" name is suitable as directory name
of debugfs filesystem.
- debugfs related reference: http://lwn.net/Articles/334546/
Fix inconsistency of directory name to mount debugfs filesystem.
* From Steven Rostedt
- find_debugfs() and tracing_files() in this patch.
Signed-off-by: GeunSik Lim <geunsik.lim@samsung.com>
Acked-by : Inaky Perez-Gonzalez <inaky@linux.intel.com>
Reviewed-by : Steven Rostedt <rostedt@goodmis.org>
Reviewed-by : James Smart <james.smart@emulex.com>
CC: Jiri Kosina <trivial@kernel.org>
CC: David Airlie <airlied@linux.ie>
CC: Peter Osterlund <petero2@telia.com>
CC: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
During bootup performance tracing we see repeated occurrences of
/sys/kernel/uid/* events for the same uid, leading to a,
in this case, rather pointless userspace processing for the
same uid over and over.
This is usually caused by tools which change their uid to "nobody",
to run without privileges to read data supplied by untrusted users.
This change delays the execution of the (already existing) scheduled
work, to cleanup the uid after one second, so the allocated and announced
uid can possibly be re-used by another process.
This is the current behavior, where almost every invocation of a
binary, which changes the uid, creates two events:
$ read START < /sys/kernel/uevent_seqnum; \
for i in `seq 100`; do su --shell=/bin/true bin; done; \
read END < /sys/kernel/uevent_seqnum; \
echo $(($END - $START))
178
With the delayed cleanup, we get only two events, and userspace finishes
a bit faster too:
$ read START < /sys/kernel/uevent_seqnum; \
for i in `seq 100`; do su --shell=/bin/true bin; done; \
read END < /sys/kernel/uevent_seqnum; \
echo $(($END - $START))
1
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: Logic to move non pinned timers
timers: /proc/sys sysctl hook to enable timer migration
timers: Identifying the existing pinned timers
timers: Framework for identifying pinned timers
timers: allow deferrable timers for intervals tv2-tv5 to be deferred
Fix up conflicts in kernel/sched.c and kernel/timer.c manually
* 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevent: export register_device and delta2ns
clockevents: tick_broadcast_device can become static
* 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: prevent selection of low resolution clocksourse also for nohz=on
clocksource: sanity check sysfs clocksource changes
This false positive is due to field padding in struct sigqueue. When
this dynamically allocated structure is copied to the stack (in arch-
specific delivery code), kmemcheck sees a read from the padding, which
is, naturally, uninitialized.
Hide the false positive using the __GFP_NOTRACK_FALSE_POSITIVE flag.
Also made the rlimit override code a bit clearer by introducing a new
variable.
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This gets rid of a heap of false-positive warnings from the tracer
code due to the use of bitfields.
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
With kmemcheck enabled, the slab allocator needs to do this:
1. Tell kmemcheck to allocate the shadow memory which stores the status of
each byte in the allocation proper, e.g. whether it is initialized or
uninitialized.
2. Tell kmemcheck which parts of memory that should be marked uninitialized.
There are actually a few more states, such as "not yet allocated" and
"recently freed".
If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
memory that can take page faults because of kmemcheck.
If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
request memory with the __GFP_NOTRACK flag. This does not prevent the page
faults from occuring, however, but marks the object in question as being
initialized so that no warnings will ever be produced for this object.
In addition to (and in contrast to) __GFP_NOTRACK, the
__GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
not be tracked _because_ it would produce a false positive. Their values
are identical, but need not be so in the future (for example, we could now
enable/disable false positives with a config option).
Parts of this patch were contributed by Pekka Enberg but merged for
atomicity.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (53 commits)
.gitignore: ignore *.lzma files
kbuild: add generic --set-str option to scripts/config
kbuild: simplify argument loop in scripts/config
kbuild: handle non-existing options in scripts/config
kallsyms: generalize text region handling
kallsyms: support kernel symbols in Blackfin on-chip memory
documentation: make version fix
kbuild: fix a compile warning
gitignore: Add GNU GLOBAL files to top .gitignore
kbuild: fix delay in setlocalversion on readonly source
README: fix misleading pointer to the defconf directory
vmlinux.lds.h update
kernel-doc: cleanup perl script
Improve vmlinux.lds.h support for arch specific linker scripts
kbuild: fix headers_exports with boolean expression
kbuild/headers_check: refine extern check
kbuild: fix "Argument list too long" error for "make headers_check",
ignore *.patch files
Remove bashisms from scripts
menu: fix embedded menu presentation
...
General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.
Thanks to Andi Kleen for the set_memory_4k() solution.
Andrew Morton suggested documenting the shadow member of struct page.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
commit 3f68535ada (clocksource: sanity check sysfs clocksource
changes) prevents selection of non high resolution capable
clocksources when high resolution mode is active, but did not take
into account that the same rules apply for highres=off nohz=on.
Check the tick device mode instead of hrtimer_hres_active() to verify
whether the system needs to be protected from a switch to jiffies or
other non highres capable clock sources.
Reported-by: Luming Yu <luming.yu@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Rationale: kmemcheck needs to be able to schedule a tasklet without
touching any dynamically allocated memory _at_ _all_ (since that would
lead to a recursive page fault). This tasklet is used for writing the
error reports to the kernel log.
The new scheduling function avoids touching any other tasklets by
inserting the new tasklist as the head of the "tasklet_hi" list instead
of on the tail.
Also don't wake up the softirq thread lest the scheduler access some
tracked memory and we go down with a recursive page fault.
In this case, we'd better just wait for the maximum time of 1/HZ for the
message to appear.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Start documenting HAVE_PERF_COUNTERS requirements
perf_counter: Add forward/backward attribute ABI compatibility
perf record: Explicity program a default counter
perf_counter: Remove PERF_TYPE_RAW special casing
perf_counter: PERF_TYPE_HW_CACHE is a hardware counter too
powerpc, perf_counter: Fix performance counter event types
perf_counter/x86: Add a quirk for Atom processors
perf_counter tools: Remove one L1-data alias
The *_nvs_* routines in swsusp.c make use of the io*map()
functions, which are only provided for HAS_IOMEM, thus
breaking compilation if HAS_IOMEM is not set. Fix this
by moving the *_nvs_* routines into hibernate_nvs.c, which
is only compiled if HAS_IOMEM is set.
[rjw: Change the name of the new file to hibernate_nvs.c, add the
license line to the header comment.]
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Change the name of kernel/power/disk.c to kernel/power/hibernate.c
in analogy with the file names introduced by the changes that
separated the suspend to RAM and standby funtionality from the
common PM functions.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Move the suspend to RAM and standby code from kernel/power/main.c
to two separate files, kernel/power/suspend.c containing the basic
functions and kernel/power/suspend_test.c containing the automatic
suspend test facility based on the RTC clock alarm.
There are no changes in functionality related to these modifications.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
A future patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
This patch (as1241) renames a bunch of functions in the PM core.
Rather than go through a boring list of name changes, suffice it to
say that in the end we have a bunch of pairs of functions:
device_resume_noirq dpm_resume_noirq
device_resume dpm_resume
device_complete dpm_complete
device_suspend_noirq dpm_suspend_noirq
device_suspend dpm_suspend
device_prepare dpm_prepare
in which device_X does the X operation on a single device and dpm_X
invokes device_X for all devices in the dpm_list.
In addition, the old dpm_power_up and device_resume_noirq have been
combined into a single function (dpm_resume_noirq).
Lastly, dpm_suspend_start and dpm_resume_end are the renamed versions
of the former top-level device_suspend and device_resume routines.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Rename the functions performing "_noirq" dev_pm_ops
operations from device_power_down() and device_power_up()
to device_suspend_noirq() and device_resume_noirq().
The new function names are chosen to show that the functions
are responsible for calling the _noirq() versions to finalize
the suspend/resume operation. The current function names do
not perform power down/up anymore so the names may be misleading.
Global function renames:
- device_power_down() -> device_suspend_noirq()
- device_power_up() -> device_resume_noirq()
Static function renames:
- suspend_device_noirq() -> __device_suspend_noirq()
- resume_device_noirq() -> __device_resume_noirq()
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* 'topic/slab/earlyboot-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: setup cpu caches later on when interrupts are enabled
slab,slub: don't enable interrupts during early boot
slab: fix gfp flag in setup_cpu_cache()
x86: make zap_low_mapping could be used early
irq: slab alloc for default irq_affinity
memcg: fix page_cgroup fatal error in FLATMEM
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (31 commits)
lguest: add support for indirect ring entries
lguest: suppress notifications in example Launcher
lguest: try to batch interrupts on network receive
lguest: avoid sending interrupts to Guest when no activity occurs.
lguest: implement deferred interrupts in example Launcher
lguest: remove obsolete LHREQ_BREAK call
lguest: have example Launcher service all devices in separate threads
lguest: use eventfds for device notification
eventfd: export eventfd_signal and eventfd_fget for lguest
lguest: allow any process to send interrupts
lguest: PAE fixes
lguest: PAE support
lguest: Add support for kvm_hypercall4()
lguest: replace hypercall name LHCALL_SET_PMD with LHCALL_SET_PGD
lguest: use native_set_* macros, which properly handle 64-bit entries when PAE is activated
lguest: map switcher with executable page table entries
lguest: fix writev returning short on console output
lguest: clean up length-used value in example launcher
lguest: Segment selectors are 16-bit long. Fix lg_cpu.ss1 definition.
lguest: beyond ARRAY_SIZE of cpu->arch.gdt
...
lguest needs kick_process: wake_up_process() does nothing if a process
is running, which isn't sufficient (we need it in the kernel).
And lguest support is usually modular.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Provide for means of extending the perf_counter_attr in a 'natural' way.
We allow growing the structure by appending fields at the end by specifying
the full structure size inside it.
When a new kernel sees a smaller (old) structure, it will 0 pad the tail.
When an old kernel sees a larger (new) structure, it will verify the tail
consists of 0s, otherwise fail.
If we fail due to a size-mismatch, we return -E2BIG and write the kernel's
native attribe size back into the provided structure.
Furthermore, add some attribute verification, so that we'll fail counter
creation when unknown bits are present (PERF_SAMPLE, PERF_FORMAT, or in
the __reserved fields).
(This ABI detail is introduced while keeping the existing syscall ABI.)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The PERF_TYPE_RAW special case seems superfluous these days. Remove
it and add it to the switch() stmt like the others.
[ Impact: cleanup ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's theoretically possible that there are exception table entries
which point into the (freed) init text of modules. These could cause
future problems if other modules get loaded into that memory and cause
an exception as we'd see the wrong fixup. The only case I know of is
kvm-intel.ko (when CONFIG_CC_OPTIMIZE_FOR_SIZE=n).
Amerigo fixed this long-standing FIXME in the x86 version, but this
patch is more general.
This implements trim_init_extable(); most archs are simple since they
use the standard lib/extable.c sort code. Alpha and IA64 use relative
addresses in their fixups, so thier trimming is a slight variation.
Sparc32 is unique; it doesn't seem to define ARCH_HAS_SORT_EXTABLE,
yet it defines its own sort_extable() which overrides the one in lib.
It doesn't sort, so we have to mark deleted entries instead of
actually trimming them.
Inspired-by: Amerigo Wang <amwang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-alpha@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-ia64@vger.kernel.org