Commit Graph

3268 Commits

Author SHA1 Message Date
Tejun Heo
ce3141a277 percpu: drop pcpu_chunk->page[]
percpu core doesn't need to tack all the allocated pages.  It needs to
know whether certain pages are populated and a way to reverse map
address to page when freeing.  This patch drops pcpu_chunk->page[] and
use populated bitmap and vmalloc_to_page() lookup instead.  Using
vmalloc_to_page() exclusively is also possible but complicates first
chunk handling, inflates cache footprint and prevents non-standard
memory allocation for percpu memory.

pcpu_chunk->page[] was used to track each page's allocation and
allowed asymmetric population which happens during failure path;
however, with single bitmap for all units, this is no longer possible.
Bite the bullet and rewrite (de)populate functions so that things are
done in clearly separated steps such that asymmetric population
doesn't happen.  This makes the (de)population process much more
modular and will also ease implementing non-standard memory usage in
the future (e.g. large pages).

This makes @get_page_fn parameter to pcpu_setup_first_chunk()
unnecessary.  The parameter is dropped and all first chunk helpers are
updated accordingly.  Please note that despite the volume most changes
to first chunk helpers are symbol renames for variables which don't
need to be referenced outside of the helper anymore.

This change reduces memory usage and cache footprint of pcpu_chunk.
Now only #unit_pages bits are necessary per chunk.

[ Impact: reduced memory usage and cache footprint for bookkeeping ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-07-04 08:11:00 +09:00
Tejun Heo
c8a51be4ca percpu: reorder a few functions in mm/percpu.c
(de)populate functions are about to be reimplemented to drop
pcpu_chunk->page array.  Move a few functions so that the rewrite
patch doesn't have code movement making it more difficult to read.

[ Impact: code movement ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Tejun Heo
38a6be5254 percpu: simplify pcpu_setup_first_chunk()
Now that all first chunk allocator helpers allocate and map the first
chunk themselves, there's no need to have optional default alloc/map
in pcpu_setup_first_chunk().  Drop @populate_pte_fn and only leave
@dyn_size optional and make all other params mandatory.

This makes it much easier to follow what pcpu_setup_first_chunk() is
doing and what actual differences tweaking each parameter results in.

[ Impact: drop unused code path ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Tejun Heo
8c4bfc6e88 x86,percpu: generalize lpage first chunk allocator
Generalize and move x86 setup_pcpu_lpage() into
pcpu_lpage_first_chunk().  setup_pcpu_lpage() now is a simple wrapper
around the generalized version.  Other than taking size parameters and
using arch supplied callbacks to allocate/free/map memory,
pcpu_lpage_first_chunk() is identical to the original implementation.

This simplifies arch code and will help converting more archs to
dynamic percpu allocator.

While at it, factor out pcpu_calc_fc_sizes() which is common to
pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().

[ Impact: code reorganization and generalization ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Tejun Heo
8f05a6a65d percpu: make 4k first chunk allocator map memory
At first, percpu first chunk was always setup page-by-page by the
generic code.  To add other allocators, different parts of the generic
initialization was made optional.  Now we have three allocators -
embed, remap and 4k.  embed and remap fully handle allocation and
mapping of the first chunk while 4k still depends on generic code for
those.  This makes the generic alloc/map paths specifci to 4k and
makes the code unnecessary complicated with optional generic
behaviors.

This patch makes the 4k allocator to allocate and map memory directly
instead of depending on the generic code.  The only outside visible
change is that now dynamic area in the first chunk is allocated
up-front instead of on-demand.  This doesn't make any meaningful
difference as the area is minimal (usually less than a page, just
enough to fill the alignment) on 4k allocator.  Plus, dynamic area in
the first chunk usually gets fully used anyway.

This will allow simplification of pcpu_setpu_first_chunk() and removal
of chunk->page array.

[ Impact: no outside visible change other than up-front allocation of dyn area ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Tejun Heo
d4b95f8039 x86,percpu: generalize 4k first chunk allocator
Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
setup_pcpu_4k() now is a simple wrapper around the generalized
version.  Other than taking size parameters and using arch supplied
callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
to the original implementation.

This simplifies arch code and will help converting more archs to
dynamic percpu allocator.

While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
consistency.

[ Impact: code reorganization and generalization ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Tejun Heo
788e5abc54 percpu: drop @unit_size from embed first chunk allocator
The only extra feature @unit_size provides is making dead space at the
end of the first chunk which doesn't have any valid usecase.  Drop the
parameter.  This will increase consistency with generalized 4k
allocator.

James Bottomley spotted missing conversion for the default
setup_per_cpu_areas() which caused build breakage on all arcsh which
use it.

[ Impact: drop unused code path ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:58 +09:00
Tejun Heo
79ba6ac825 x86: make pcpu_chunk_addr_search() matching stricter
The @addr passed into pcpu_chunk_addr_search() is unit0 based address
and thus should be matched inside unit0 area.  Currently, when it uses
chunk size when determining whether the address falls in the first
chunk.  Addresses in unitN where N>0 shouldn't be passed in anyway, so
this doesn't cause any malfunction but fix it for consistency.

[ Impact: mostly cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:58 +09:00
Tejun Heo
c43768cbb7 Merge branch 'master' into for-next
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes.  As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.

Conflicts:
	arch/alpha/include/asm/percpu.h
	arch/mn10300/kernel/vmlinux.lds.S
	include/linux/percpu-defs.h
2009-07-04 07:13:18 +09:00
Linus Torvalds
5a475ce469 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: LCDC dcache flush for deferred io
  sh: Fix compiler error and include the definition of IS_ERR_VALUE
  sh: re-add LCDC fbdev support to the Migo-R defconfig
  sh: fix se7724 ceu names
  sh: ms7724se: Enable sh_eth in defconfig.
  arch/sh/boards/mach-se/7206/io.c: Remove unnecessary semicolons
  sh: ms7724se: Add sh_eth support
  nommu: provide follow_pfn().
  sh: Kill off unused DEBUG_BOOTMEM symbol.
  perf_counter tools: add cpu_relax()/rmb() definitions for sh.
  sh64: Hook up page fault events for software perf counters.
  sh: Hook up page fault events for software perf counters.
  sh: make set_perf_counter_pending() static inline.
  clocksource: sh_tmu: Make undefined TCOR behaviour less undefined.
2009-07-01 11:46:30 -07:00
Ingo Molnar
57d81f6f39 kmemleak: Fix scheduling-while-atomic bug
One of the kmemleak changes caused the following
scheduling-while-holding-the-tasklist-lock regression on x86:

BUG: sleeping function called from invalid context at mm/kmemleak.c:795
in_atomic(): 1, irqs_disabled(): 0, pid: 1737, name: kmemleak
2 locks held by kmemleak/1737:
 #0:  (scan_mutex){......}, at: [<c10c4376>] kmemleak_scan_thread+0x45/0x86
 #1:  (tasklist_lock){......}, at: [<c10c3bb4>] kmemleak_scan+0x1a9/0x39c
Pid: 1737, comm: kmemleak Not tainted 2.6.31-rc1-tip #59266
Call Trace:
 [<c105ac0f>] ? __debug_show_held_locks+0x1e/0x20
 [<c102e490>] __might_sleep+0x10a/0x111
 [<c10c38d5>] scan_yield+0x17/0x3b
 [<c10c3970>] scan_block+0x39/0xd4
 [<c10c3bc6>] kmemleak_scan+0x1bb/0x39c
 [<c10c4331>] ? kmemleak_scan_thread+0x0/0x86
 [<c10c437b>] kmemleak_scan_thread+0x4a/0x86
 [<c104d73e>] kthread+0x6e/0x73
 [<c104d6d0>] ? kthread+0x0/0x73
 [<c100959f>] kernel_thread_helper+0x7/0x10
kmemleak: 834 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

The bit causing it is highly dubious:

static void scan_yield(void)
{
        might_sleep();

        if (time_is_before_eq_jiffies(next_scan_yield)) {
                schedule();
                next_scan_yield = jiffies + jiffies_scan_yield;
        }
}

It called deep inside the codepath and in a conditional way,
and that is what crapped up when one of the new scan_block()
uses grew a tasklist_lock dependency.

This minimal patch removes that yielding stuff and adds the
proper cond_resched().

The background scanning thread could probably also be reniced
to +10.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-01 10:26:23 -07:00
Linus Torvalds
e83c2b0ff3 Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6
* 'kmemleak' of git://linux-arm.org/linux-2.6:
  kmemleak: Inform kmemleak about pid_hash
  kmemleak: Do not warn if an unknown object is freed
  kmemleak: Do not report new leaked objects if the scanning was stopped
  kmemleak: Slightly change the policy on newly allocated objects
  kmemleak: Do not trigger a scan when reading the debug/kmemleak file
  kmemleak: Simplify the reports logged by the scanning thread
  kmemleak: Enable task stacks scanning by default
  kmemleak: Allow the early log buffer to be configurable.
2009-06-30 19:04:53 -07:00
Yinghai Lu
66918dcdf9 x86: only clear node_states for 64bit
Nathan reported that

| commit 73d60b7f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date:   Tue Jun 16 15:33:00 2009 -0700
|
|    page-allocator: clear N_HIGH_MEMORY map before we set it again
|
|    SRAT tables may contains nodes of very small size.  The arch code may
|    decide to not activate such a node.  However, currently the early boot
|    code sets N_HIGH_MEMORY for such nodes.  These nodes therefore seem to be
|    active although these nodes have no present pages.
|
|    For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
an i386 kvm guest, meaning that cpuset.mems can not be used.

Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn

Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:01 -07:00
Richard Kennedy
d7831a0bdf mm: prevent balance_dirty_pages() from doing too much work
balance_dirty_pages can overreact and move all of the dirty pages to
writeback unnecessarily.

balance_dirty_pages makes its decision to throttle based on the number of
dirty plus writeback pages that are over the calculated limit,so it will
continue to move pages even when there are plenty of pages in writeback
and less than the threshold still dirty.

This allows it to overshoot its limits and move all the dirty pages to
writeback while waiting for the drives to catch up and empty the writeback
list.

A simple fio test easily demonstrates this problem.

fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10

This is the simplest fix I could find, but I'm not entirely sure that it
alone will be enough for all cases.  But it certainly is an improvement on
my desktop machine writing to 2 disks.

Do we need something more for machines with large arrays where
bdi_threshold * number_of_drives is greater than the dirty_ratio ?

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:01 -07:00
Thomas Gleixner
c49568235d dmapools: protect page_list walk in show_pools()
show_pools() walks the page_list of a pool w/o protection against the list
modifications in alloc/free.  Take pool->lock to avoid stomping into
nirvana.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:00 -07:00
Catalin Marinas
b6e687221e kmemleak: Do not warn if an unknown object is freed
vmap'ed memory blocks are not tracked by kmemleak (yet) but they may be
released with vfree() which is tracked. The corresponding kmemleak
warning is only enabled in debug mode. Future patch will add support for
ioremap and vmap.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-29 17:14:14 +01:00
Catalin Marinas
17bb9e0d90 kmemleak: Do not report new leaked objects if the scanning was stopped
If the scanning was stopped with a signal, it is possible that some
objects are left with a white colour (potential leaks) and reported. Add
a check to avoid reporting such objects.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-29 17:14:13 +01:00
Linus Torvalds
8326e284f8 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, delay: tsc based udelay should have rdtsc_barrier
  x86, setup: correct include file in <asm/boot.h>
  x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
  x86, mce: percpu mcheck_timer should be pinned
  x86: Add sysctl to allow panic on IOCK NMI error
  x86: Fix uv bau sending buffer initialization
  x86, mce: Fix mce resume on 32bit
  x86: Move init_gbpages() to setup_arch()
  x86: ensure percpu lpage doesn't consume too much vmalloc space
  x86: implement percpu_alloc kernel parameter
  x86: fix pageattr handling for lpage percpu allocator and re-enable it
  x86: reorganize cpa_process_alias()
  x86: prepare setup_pcpu_lpage() for pageattr fix
  x86: rename remap percpu first chunk allocator to lpage
  x86: fix duplicate free in setup_pcpu_remap() failure path
  percpu: fix too lazy vunmap cache flushing
  x86: Set cpu_llc_id on AMD CPUs
2009-06-28 11:05:28 -07:00
Catalin Marinas
acf4968ec9 kmemleak: Slightly change the policy on newly allocated objects
Newly allocated objects are more likely to be reported as false
positives. Kmemleak ignores the reporting of objects younger than 5
seconds. However, this age was calculated after the memory scanning
completed which usually takes longer than 5 seconds. This patch
make the minimum object age calculation in relation to the start of the
memory scanning.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26 17:38:29 +01:00
Catalin Marinas
4698c1f2bb kmemleak: Do not trigger a scan when reading the debug/kmemleak file
Since there is a kernel thread for automatically scanning the memory, it
makes sense for the debug/kmemleak file to only show its findings. This
patch also adds support for "echo scan > debug/kmemleak" to trigger an
intermediate memory scan and eliminates the kmemleak_mutex (scan_mutex
covers all the cases now).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26 17:38:27 +01:00
Catalin Marinas
bab4a34afc kmemleak: Simplify the reports logged by the scanning thread
Because of false positives, the memory scanning thread may print too
much information. This patch changes the scanning thread to only print
the number of newly suspected leaks. Further information can be read
from the /sys/kernel/debug/kmemleak file.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26 17:38:26 +01:00
Catalin Marinas
e0a2a1601b kmemleak: Enable task stacks scanning by default
This is to reduce the number of false positives reported.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26 17:38:25 +01:00
Paul Mundt
dfc2f91ac2 nommu: provide follow_pfn().
With the introduction of follow_pfn() as an exported symbol, modules have
begun making use of it. Unfortunately this was not reflected on nommu at
the time, so the in-tree users have subsequently all blown up with link
errors there.

This provides a simple follow_pfn() that just returns addr >> PAGE_SHIFT,
which will do the right thing on nommu. There is no need to do range
checking within the vma, as the find_vma() case will already take care of
this.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-06-26 04:31:57 +09:00
Peter Zijlstra
9d73777e50 clarify get_user_pages() prototype
Currently the 4th parameter of get_user_pages() is called len, but its
in pages, not bytes. Rename the thing to nr_pages to avoid future
confusion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-25 11:22:13 -07:00
Catalin Marinas
a9d9058aba kmemleak: Allow the early log buffer to be configurable.
(feature suggested by Sergey Senozhatsky)

Kmemleak needs to track all the memory allocations but some of these
happen before kmemleak is initialised. These are stored in an internal
buffer which may be exceeded in some kernel configurations. This patch
adds a configuration option with a default value of 400 and also removes
the stack dump when the early log buffer is exceeded.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by>
2009-06-25 10:16:13 +01:00
Linus Torvalds
c622304825 Merge branches 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/{vfs-2.6,audit-current}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  another race fix in jfs_check_acl()
  Get "no acls for this inode" right, fix shmem breakage
  inline functions left without protection of ifdef (acl)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL
2009-06-24 14:17:14 -07:00
Al Viro
72c04902d1 Get "no acls for this inode" right, fix shmem breakage
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 16:58:48 -04:00
Pekka Enberg
ba52270d18 SLUB: Don't pass __GFP_FAIL for the initial allocation
SLUB uses higher order allocations by default but falls back to small
orders under memory pressure. Make sure the GFP mask used in the initial
allocation doesn't include __GFP_NOFAIL.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-24 12:20:14 -07:00
Linus Torvalds
4923abf9f1 Don't warn about order-1 allocations with __GFP_NOFAIL
Traditionally, we never failed small orders (even regardless of any
__GFP_NOFAIL flags), and slab will allocate order-1 allocations even for
small allocations that could fit in a single page (in order to avoid
excessive fragmentation).

Maybe we should remove this warning entirely, but before making that
judgement, at least limit it to bigger allocations.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-24 12:16:49 -07:00
Al Viro
06b16e9f68 switch shmem to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:06 -04:00
Tejun Heo
245b2e70ea percpu: clean up percpu variable definitions
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique.  Update percpu
variable definitions accordingly.

* as,cfq: rename ioc_count uniquely

* cpufreq: rename cpu_dbs_info uniquely

* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it

* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
  rename it

* ipv4,6: rename cookie_scratch uniquely

* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
  pmc_irq_entry and nmi_entry to pmc_nmi_entry

* perf_counter: rename disable_count to perf_disable_count

* ftrace: rename test_event_disable to ftrace_test_event_disable

* kmemleak: rename test_pointer to kmemleak_test_pointer

* mce: rename next_interval to mce_next_interval

[ Impact: percpu usage cleanups, no duplicate static percpu var names ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
2009-06-24 15:13:48 +09:00
Tejun Heo
204fba4aa3 percpu: cleanup percpu array definitions
Currently, the following three different ways to define percpu arrays
are in use.

1. DEFINE_PER_CPU(elem_type[array_len], array_name);
2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
3. DEFINE_PER_CPU(elem_type, array_name)[array_len];

Unify to #1 which correctly separates the roles of the two parameters
and thus allows more flexibility in the way percpu variables are
defined.

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
2009-06-24 15:13:45 +09:00
Tejun Heo
e74e396204 percpu: use dynamic percpu allocator as the default percpu allocator
This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
dynamic percpu allocator.  The first chunk is allocated using
embedding helper and 8k is reserved for modules.  This ensures that
the new allocator behaves almost identically to the original allocator
as long as static percpu variables are concerned, so it shouldn't
introduce much breakage.

s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
range limit the addressing model imposes.  Unfortunately, this breaks
if the address is specified using a variable, so for now, the two
archs aren't converted.

The following architectures are affected by this change.

* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc (broken, under investigation)
* m32r
* powerpc(32)

As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs.  These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.

* powerpc(64)
* sparc(64)
* ia64
* alpha
* s390

Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization).  Compile tested on
sparc(32), powerpc(32), arm and alpha.

Kyle McMartin reported that this change breaks parisc.  The problem is
still under investigation and he is okay with pushing this patch
forward and fixing parisc later.

[ Impact: use dynamic allocator for most archs w/o custom percpu setup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
2009-06-24 15:13:35 +09:00
Dimitri Sivanich
364df0ebfb mm: fix handling of pagesets for downed cpus
After downing/upping a cpu, an attempt to set
/proc/sys/vm/percpu_pagelist_fraction results in an oops in
percpu_pagelist_fraction_sysctl_handler().

If a processor is downed then we need to set the pageset pointer back to
the boot pageset.

Updates of the high water marks should not access pagesets of unpopulated
zones (those pointer go to the boot pagesets which would be no longer
functional if their size would be increased beyond zero).

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23 12:50:05 -07:00
Hugh Dickins
a5c9b696ec mm: pass mm to grab_swap_token
If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page, then
oops in grab_swap_token() because the kthread has no mm: GUP passes down
the right mm, so grab_swap_token() ought to be using it.

We have not identified a stronger case than KSM's daemon (not yet in
mainline), but the issue must have come up before, since RHEL has included
a fix for this for years (though a different fix, they just back out of
grab_swap_token if current->mm is unset: which is what we first proposed,
but using the right mm here seems more correct).

Reported-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23 12:50:05 -07:00
Linus Torvalds
95b3692d9c Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6
* 'kmemleak' of git://linux-arm.org/linux-2.6:
  kmemleak: Do not force the slab debugging Kconfig options
  kmemleak: use pr_fmt
2009-06-23 11:25:04 -07:00
Hugh Dickins
d26ed650d9 mm: don't rely on flags coincidence
Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
and it's tempting to devise a set of Grand Unified Paging flags;
but not today.  So until then, let's rely upon the compiler to spot
the coincidence, "rather than have that subtle dependency and a
comment for it" - as you remarked in another context yesterday.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23 11:23:33 -07:00
Hugh Dickins
788c7df451 hugetlb: fault flags instead of write_access
handle_mm_fault() is now passing fault flags rather than write_access
down to hugetlb_fault(), so better recognize that in hugetlb_fault(),
and in hugetlb_no_page().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23 11:23:33 -07:00
KAMEZAWA Hiroyuki
cb4cbcf6b3 mm: fix incorrect page removal from LRU
The isolated page is "cursor_page" not "page".

This could cause LRU list corruption under memory pressure, caught by
CONFIG_DEBUG_LIST.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23 10:17:28 -07:00
Joe Perches
ae281064be kmemleak: use pr_fmt
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-23 14:40:26 +01:00
Tejun Heo
fa8a7094ba x86: implement percpu_alloc kernel parameter
According to Andi, it isn't clear whether lpage allocator is worth the
trouble as there are many processors where PMD TLB is far scarcer than
PTE TLB.  The advantage or disadvantage probably depends on the actual
size of percpu area and specific processor.  As performance
degradation due to TLB pressure tends to be highly workload specific
and subtle, it is difficult to decide which way to go without more
data.

This patch implements percpu_alloc kernel parameter to allow selecting
which first chunk allocator to use to ease debugging and testing.

While at it, make sure all the failure paths report why something
failed to help determining why certain allocator isn't working.  Also,
kill the "Great future plan" comment which had already been realized
quite some time ago.

[ Impact: allow explicit percpu first chunk allocator selection ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22 11:56:24 +09:00
Tejun Heo
85ae87c1ad percpu: fix too lazy vunmap cache flushing
In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as
the page is going to be returned to the page allocator.  Only TLB
flushing can be put off such that vmalloc code can handle it lazily.
Fix it.

[ Impact: fix subtle virtual cache flush bug ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22 11:56:23 +09:00
Linus Torvalds
d06063cc22 Move FAULT_FLAG_xyz into handle_mm_fault() callers
This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault().  All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-21 13:08:22 -07:00
Linus Torvalds
30c9f3a9fa Remove internal use of 'write_access' in mm/memory.c
The fault handling routines really want more fine-grained flags than a
single "was it a write fault" boolean - the callers will want to set
flags like "you can return a retry error" etc.

And that's actually how the VM works internally, but right now the
top-level fault handling functions in mm/memory.c all pass just the
'write_access' boolean around.

This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
variable instead.  The 'write_access' calling convention still exists
for the exported 'handle_mm_fault()' function, but that is next.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-21 13:06:05 -07:00
Johannes Weiner
c277331d5f mm: page_alloc: clear PG_locked before checking flags on free
da456f1 "page allocator: do not disable interrupts in free_page_mlock()" moved
the PG_mlocked clearing after the flag sanity checking which makes mlocked
pages always trigger 'bad page'.  Fix this by clearing the bit up front.

Reported--and-debugged-by: Peter Chubb <peter.chubb@nicta.com.au>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-20 16:08:22 -07:00
Joe Perches
433f13a727 bootmem.c: avoid c90 declaration warning
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-19 16:46:04 -07:00
Benjamin Herrenschmidt
dcce284a25 mm: Extend gfp masking to the page allocator
The page allocator also needs the masking of gfp flags during boot,
so this moves it out of slab/slub and uses it with the page allocator
as well.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:12:57 -07:00
KAMEZAWA Hiroyuki
2ffebca6aa memcg: fix lru rotation in isolate_pages
Try to fix memcg's lru rotation sanity: make memcg use the same logic as
the global LRU does.

Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
tail of LRU in global LRU's isolate LRU pages.  But in memcg, it's not
handled.  This makes memcg do the same behavior as global LRU and rotate
LRU in the page is busy.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:48 -07:00
KAMEZAWA Hiroyuki
22a668d7c3 memcg: fix behavior under memory.limit equals to memsw.limit
A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the
user just want to limit the total size of applications, in other words,
not very interested in memory usage itself.  In this case, swap-out will
be done only by global-LRU.

But, under current implementation, memory.limit_in_bytes is checked at
first and try_to_free_page() may do swap-out.  But, that swap-out is
useless for memsw.limit_in_bytes and the thread may hit limit again.

This patch tries to fix the current behavior at memory.limit ==
memsw.limit case.  And documentation is updated to explain the behavior of
this special case.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:48 -07:00
KAMEZAWA Hiroyuki
8a9478ca7f memcg: fix swap accounting
This patch fixes mis-accounting of swap usage in memcg.

In the current implementation, memcg's swap account is uncharged only when
swap is completely freed.  But there are several cases where swap cannot
be freed cleanly.  For handling that, this patch changes that memcg
uncharges swap account when swap has no references other than cache.

By this, memcg's swap entry accounting can be fully synchronous with the
application's behavior.

This patch also changes memcg's hooks for swap-out.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:47 -07:00