percpu core doesn't need to tack all the allocated pages. It needs to
know whether certain pages are populated and a way to reverse map
address to page when freeing. This patch drops pcpu_chunk->page[] and
use populated bitmap and vmalloc_to_page() lookup instead. Using
vmalloc_to_page() exclusively is also possible but complicates first
chunk handling, inflates cache footprint and prevents non-standard
memory allocation for percpu memory.
pcpu_chunk->page[] was used to track each page's allocation and
allowed asymmetric population which happens during failure path;
however, with single bitmap for all units, this is no longer possible.
Bite the bullet and rewrite (de)populate functions so that things are
done in clearly separated steps such that asymmetric population
doesn't happen. This makes the (de)population process much more
modular and will also ease implementing non-standard memory usage in
the future (e.g. large pages).
This makes @get_page_fn parameter to pcpu_setup_first_chunk()
unnecessary. The parameter is dropped and all first chunk helpers are
updated accordingly. Please note that despite the volume most changes
to first chunk helpers are symbol renames for variables which don't
need to be referenced outside of the helper anymore.
This change reduces memory usage and cache footprint of pcpu_chunk.
Now only #unit_pages bits are necessary per chunk.
[ Impact: reduced memory usage and cache footprint for bookkeeping ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
(de)populate functions are about to be reimplemented to drop
pcpu_chunk->page array. Move a few functions so that the rewrite
patch doesn't have code movement making it more difficult to read.
[ Impact: code movement ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Now that all first chunk allocator helpers allocate and map the first
chunk themselves, there's no need to have optional default alloc/map
in pcpu_setup_first_chunk(). Drop @populate_pte_fn and only leave
@dyn_size optional and make all other params mandatory.
This makes it much easier to follow what pcpu_setup_first_chunk() is
doing and what actual differences tweaking each parameter results in.
[ Impact: drop unused code path ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Generalize and move x86 setup_pcpu_lpage() into
pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper
around the generalized version. Other than taking size parameters and
using arch supplied callbacks to allocate/free/map memory,
pcpu_lpage_first_chunk() is identical to the original implementation.
This simplifies arch code and will help converting more archs to
dynamic percpu allocator.
While at it, factor out pcpu_calc_fc_sizes() which is common to
pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
[ Impact: code reorganization and generalization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
At first, percpu first chunk was always setup page-by-page by the
generic code. To add other allocators, different parts of the generic
initialization was made optional. Now we have three allocators -
embed, remap and 4k. embed and remap fully handle allocation and
mapping of the first chunk while 4k still depends on generic code for
those. This makes the generic alloc/map paths specifci to 4k and
makes the code unnecessary complicated with optional generic
behaviors.
This patch makes the 4k allocator to allocate and map memory directly
instead of depending on the generic code. The only outside visible
change is that now dynamic area in the first chunk is allocated
up-front instead of on-demand. This doesn't make any meaningful
difference as the area is minimal (usually less than a page, just
enough to fill the alignment) on 4k allocator. Plus, dynamic area in
the first chunk usually gets fully used anyway.
This will allow simplification of pcpu_setpu_first_chunk() and removal
of chunk->page array.
[ Impact: no outside visible change other than up-front allocation of dyn area ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
setup_pcpu_4k() now is a simple wrapper around the generalized
version. Other than taking size parameters and using arch supplied
callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
to the original implementation.
This simplifies arch code and will help converting more archs to
dynamic percpu allocator.
While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
consistency.
[ Impact: code reorganization and generalization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
The only extra feature @unit_size provides is making dead space at the
end of the first chunk which doesn't have any valid usecase. Drop the
parameter. This will increase consistency with generalized 4k
allocator.
James Bottomley spotted missing conversion for the default
setup_per_cpu_areas() which caused build breakage on all arcsh which
use it.
[ Impact: drop unused code path ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ingo Molnar <mingo@elte.hu>
The @addr passed into pcpu_chunk_addr_search() is unit0 based address
and thus should be matched inside unit0 area. Currently, when it uses
chunk size when determining whether the address falls in the first
chunk. Addresses in unitN where N>0 shouldn't be passed in anyway, so
this doesn't cause any malfunction but fix it for consistency.
[ Impact: mostly cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
dynamic percpu allocator. The first chunk is allocated using
embedding helper and 8k is reserved for modules. This ensures that
the new allocator behaves almost identically to the original allocator
as long as static percpu variables are concerned, so it shouldn't
introduce much breakage.
s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
range limit the addressing model imposes. Unfortunately, this breaks
if the address is specified using a variable, so for now, the two
archs aren't converted.
The following architectures are affected by this change.
* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc (broken, under investigation)
* m32r
* powerpc(32)
As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs. These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.
* powerpc(64)
* sparc(64)
* ia64
* alpha
* s390
Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization). Compile tested on
sparc(32), powerpc(32), arm and alpha.
Kyle McMartin reported that this change breaks parisc. The problem is
still under investigation and he is okay with pushing this patch
forward and fixing parisc later.
[ Impact: use dynamic allocator for most archs w/o custom percpu setup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
According to Andi, it isn't clear whether lpage allocator is worth the
trouble as there are many processors where PMD TLB is far scarcer than
PTE TLB. The advantage or disadvantage probably depends on the actual
size of percpu area and specific processor. As performance
degradation due to TLB pressure tends to be highly workload specific
and subtle, it is difficult to decide which way to go without more
data.
This patch implements percpu_alloc kernel parameter to allow selecting
which first chunk allocator to use to ease debugging and testing.
While at it, make sure all the failure paths report why something
failed to help determining why certain allocator isn't working. Also,
kill the "Great future plan" comment which had already been realized
quite some time ago.
[ Impact: allow explicit percpu first chunk allocator selection ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as
the page is going to be returned to the page allocator. Only TLB
flushing can be put off such that vmalloc code can handle it lazily.
Fix it.
[ Impact: fix subtle virtual cache flush bug ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Impact: use page->index for addr to chunk mapping instead of dedicated rbtree
The rbtree is used to determine the chunk from the virtual address.
However, we can already determine the page struct from a virtual
address and there are several unused fields in page struct used by
vmalloc. Use the index field to store a pointer to the chunk. Then
there is no need anymore for an rbtree.
tj: * s/(set|get)_chunk/pcpu_\1_page_chunk/
* Drop inline from the above two functions and moved them upwards
so that they are with other simple helpers.
* Initial pages might not (actually most of the time don't) live
in the vmalloc area. With the previous patch to manually
reverse-map both first chunks, this is no longer an issue.
Removed pcpu_set_chunk() call on initial pages.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: rusty@rustcorp.com.au
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: ralf@linux-mips.org
Cc: davem@davemloft.net
Cc: cooloney@kernel.org
Cc: kyle@mcmartin.ca
Cc: matthew@wil.cx
Cc: grundler@parisc-linux.org
Cc: takata@linux-m32r.org
Cc: benh@kernel.crashing.org
Cc: rth@twiddle.net
Cc: ink@jurassic.park.msu.ru
Cc: heiko.carstens@de.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
LKML-Reference: <49D43D58.4050102@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: both first chunks don't use rbtree, no functional change
There can be two first chunks - reserved and dynamic with the former
one being optional. Dynamic first chunk was linked on reverse-mapping
rbtree while the reserved one was mapped manually using the start
address and reserved offset limit.
This patch makes both first chunks to be looked up manually without
using the rbtree. This is to help getting rid of the rbtree.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: rusty@rustcorp.com.au
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: ralf@linux-mips.org
Cc: davem@davemloft.net
Cc: cooloney@kernel.org
Cc: kyle@mcmartin.ca
Cc: matthew@wil.cx
Cc: grundler@parisc-linux.org
Cc: takata@linux-m32r.org
Cc: benh@kernel.crashing.org
Cc: rth@twiddle.net
Cc: ink@jurassic.park.msu.ru
Cc: heiko.carstens@de.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux.com>
LKML-Reference: <49D43CEA.3040609@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: code reorganization
Separate out embedding first chunk setup helper from x86 embedding
first chunk allocator and put it in mm/percpu.c. This will be used by
the default percpu first chunk allocator and possibly by other archs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, more flexibility for first chunk init
Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
This restriction stemmed from implementation detail and made things a
bit less intuitive. This patch allows @dyn_size to be specified
regardless of @unit_size and swaps the positions of @dyn_size and
@unit_size so that the parameter order makes more sense (static,
reserved and dyn sizes followed by enclosing unit_size).
While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: generic addr <-> pcpu ptr conversion macros
There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
__pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start
defined, they'll do the right thing regardless of actual layout.
Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
and allow archs to override it as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: fix deadlock and allow atomic free
Percpu allocation always uses GFP_KERNEL and whole alloc/free paths
were protected by single mutex. All percpu allocations have been from
GFP_KERNEL-safe context and the original allocator had this assumption
too. However, by protecting both alloc and free paths with the same
mutex, the new allocator creates free -> alloc -> GFP_KERNEL
dependency which the original allocator didn't have. This can lead to
deadlock if free is called from FS or IO paths. Also, in general,
allocators are expected to allow free to be called from atomic
context.
This patch implements finer grained locking to break the deadlock and
allow atomic free. For details, please read the "Synchronization
rules" comment.
While at it, also add CONTEXT: to function comments to describe which
context they expect to be called from and what they do to it.
This problem was reported by Thomas Gleixner and Peter Zijlstra.
http://thread.gmane.org/gmane.linux.kernel/802384
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Impact: code reorganization for later changes
Do fully free chunk reclamation using a work. This change is to
prepare for locking changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: code reorganization for later changes
Separate out chunk area map extension into a separate function -
pcpu_extend_area_map() - and call it directly from pcpu_alloc() such
that pcpu_alloc_area() is guaranteed to have enough area map slots on
invocation.
With this change, pcpu_alloc_area() does only area allocation and the
only failure mode is when the chunk doens't have enough room, so
there's no need to distinguish it from memory allocation failures.
Make it return -1 on such cases instead of hacky -ENOSPC.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: code reorganization for later changes
With static map handling moved to pcpu_split_block(), pcpu_realloc()
only clutters the code and it's also unsuitable for scheduled locking
changes. Implement and use pcpu_mem_alloc/free() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: add reserved allocation functionality and use it for module
percpu variables
This patch implements reserved allocation from the first chunk. When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator. This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.
If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation. If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation. The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.
If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow sharing page map, no functional difference yet
Make chunk->page access indirect by adding a pointer and renaming the
actual array to page_ar. This will be used by future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: argument semantic cleanup
In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
auto-sizing. It's okay for @unit_size as 0 doesn't make sense but 0
dynamic reserve size is valid. Alos, if arch @dyn_size is calculated
from other parameters, it might end up passing in 0 @dyn_size and
malfunction when the size is automatically adjusted.
This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
auto sizing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: no functional change
When the first chunk is created, its initial area map is not allocated
because kmalloc isn't online yet. The map is allocated and
initialized on the first allocation request on the chunk. This works
fine but the scattering of initialization logic between the init
function and allocation path is a bit confusing.
This patch makes the first chunk initialize and use minimal statically
allocated map from pcpu_setpu_first_chunk(). The map resizing path
still needs to handle this specially but it's more straight-forward
and gives more latitude to the init path. This will ease future
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cosmetic, preparation for future changes
Make the following renames in pcpur_setup_first_chunk() in preparation
for future changes.
* s/free_size/dyn_size/
* s/static_vm/first_vm/
* s/static_chunk/schunk/
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: remove compile warning
Mark local variable map_end in pcpu_populate_chunk() with
uninitialized_var(). The variable is always used in tandem with
map_start and guaranteed to be initialized before use but gcc doesn't
understand that.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Most global variables in percpu allocator are initialized during boot
and read only from that point on. Add __read_mostly as per Rusty's
suggestion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Impact: more latitude for first percpu chunk allocation
The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().
It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping. This patch makes
the following changes to allow this.
* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
to reserve in the first chunk for further dynamic allocation.
* Rename pcpu_setup_static() to pcpu_setup_first_chunk().
* Make pcpu_setup_first_chunk() much more flexible by fetching page
pointer by callback and adding optional @unit_size, @free_size and
@base_addr arguments which allow archs to selectively part of chunk
initialization to their likings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE
In dynamic percpu allocator, there is no reason the unit size should
be power of two. Remove the restriction.
As non-power-of-two unit size means that empty chunks fall into the
same slot index as lightly occupied chunks which is bad for reclaming.
Reserve an extra slot for empty chunks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: allow larger alignment for early vmalloc area allocation
Some early vmalloc users might want larger alignment, for example, for
custom large page mapping. Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Impact: fix short allocation leading to memory corruption
While dropping rvalue wrapping macros around global parameters,
pcpu_chunk_struct_size was set incorrectly resulting in shorter page
pointer array. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Andrew was concerned about the unit of variables named or have suffix
size. Every usage in percpu allocator is in bytes but make it super
clear by adding comments.
While at it, make pcpu_depopulate_chunk() take int @off and @size like
everyone else.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Impact: new scalable dynamic percpu allocator which allows dynamic
percpu areas to be accessed the same way as static ones
Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas. This will allow static and dynamic
areas to share faster direct access methods. This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch. Please read comment on top of mm/percpu.c for details.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>