6225e93735
On x86_64 this cuts allocation overhead for page table pages down to a fraction (kernel compile / editing load. TSC based measurement of times spend in each function): no quicklist pte_alloc 1569048 4.3s(401ns/2.7us/179.7us) pmd_alloc 780988 2.1s(337ns/2.7us/86.1us) pud_alloc 780072 2.2s(424ns/2.8us/300.6us) pgd_alloc 260022 1s(920ns/4us/263.1us) quicklist: pte_alloc 452436 573.4ms(8ns/1.3us/121.1us) pmd_alloc 196204 174.5ms(7ns/889ns/46.1us) pud_alloc 195688 172.4ms(7ns/881ns/151.3us) pgd_alloc 65228 9.8ms(8ns/150ns/6.1us) pgd allocations are the most complex and there we see the most dramatic improvement (may be we can cut down the amount of pgds cached somewhat?). But even the pte allocations still see a doubling of performance. 1. Proven code from the IA64 arch. The method used here has been fine tuned for years and is NUMA aware. It is based on the knowledge that accesses to page table pages are sparse in nature. Taking a page off the freelists instead of allocating a zeroed pages allows a reduction of number of cachelines touched in addition to getting rid of the slab overhead. So performance improves. This is particularly useful if pgds contain standard mappings. We can save on the teardown and setup of such a page if we have some on the quicklists. This includes avoiding lists operations that are otherwise necessary on alloc and free to track pgds. 2. Light weight alternative to use slab to manage page size pages Slab overhead is significant and even page allocator use is pretty heavy weight. The use of a per cpu quicklist means that we touch only two cachelines for an allocation. There is no need to access the page_struct (unless arch code needs to fiddle around with it). So the fast past just means bringing in one cacheline at the beginning of the page. That same cacheline may then be used to store the page table entry. Or a second cacheline may be used if the page table entry is not in the first cacheline of the page. The current code will zero the page which means touching 32 cachelines (assuming 128 byte). We get down from 32 to 2 cachelines in the fast path. 3. x86_64 gets lightweight page table page management. This will allow x86_64 arch code to faster repopulate pgds and other page table entries. The list operations for pgds are reduced in the same way as for i386 to the point where a pgd is allocated from the page allocator and when it is freed back to the page allocator. A pgd can pass through the quicklists without having to be reinitialized. 64 Consolidation of code from multiple arches So far arches have their own implementation of quicklist management. This patch moves that feature into the core allowing an easier maintenance and consistent management of quicklists. Page table pages have the characteristics that they are typically zero or in a known state when they are freed. This is usually the exactly same state as needed after allocation. So it makes sense to build a list of freed page table pages and then consume the pages already in use first. Those pages have already been initialized correctly (thus no need to zero them) and are likely already cached in such a way that the MMU can use them most effectively. Page table pages are used in a sparse way so zeroing them on allocation is not too useful. Such an implementation already exits for ia64. Howver, that implementation did not support constructors and destructors as needed by i386 / x86_64. It also only supported a single quicklist. The implementation here has constructor and destructor support as well as the ability for an arch to specify how many quicklists are needed. Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one quicklist is necessary then we can define NR_QUICK for additional lists. F.e. i386 needs two and thus has config NR_QUICK int default 2 If an arch has requested quicklist support then pages can be allocated from the quicklist (or from the page allocator if the quicklist is empty) via: quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>) Page table pages can be freed using: quicklist_free(<quicklist-nr>, <destructor>, <page>) Pages must have a definite state after allocation and before they are freed. If no constructor is specified then pages will be zeroed on allocation and must be zeroed before they are freed. If a constructor is used then the constructor will establish a definite page state. F.e. the i386 and x86_64 pgd constructors establish certain mappings. Constructors and destructors can also be used to track the pages. i386 and x86_64 use a list of pgds in order to be able to dynamically update standard mappings. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
171 lines
5.1 KiB
Plaintext
171 lines
5.1 KiB
Plaintext
config SELECT_MEMORY_MODEL
|
|
def_bool y
|
|
depends on EXPERIMENTAL || ARCH_SELECT_MEMORY_MODEL
|
|
|
|
choice
|
|
prompt "Memory model"
|
|
depends on SELECT_MEMORY_MODEL
|
|
default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
|
|
default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
|
|
default FLATMEM_MANUAL
|
|
|
|
config FLATMEM_MANUAL
|
|
bool "Flat Memory"
|
|
depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || ARCH_FLATMEM_ENABLE
|
|
help
|
|
This option allows you to change some of the ways that
|
|
Linux manages its memory internally. Most users will
|
|
only have one option here: FLATMEM. This is normal
|
|
and a correct option.
|
|
|
|
Some users of more advanced features like NUMA and
|
|
memory hotplug may have different options here.
|
|
DISCONTIGMEM is an more mature, better tested system,
|
|
but is incompatible with memory hotplug and may suffer
|
|
decreased performance over SPARSEMEM. If unsure between
|
|
"Sparse Memory" and "Discontiguous Memory", choose
|
|
"Discontiguous Memory".
|
|
|
|
If unsure, choose this option (Flat Memory) over any other.
|
|
|
|
config DISCONTIGMEM_MANUAL
|
|
bool "Discontiguous Memory"
|
|
depends on ARCH_DISCONTIGMEM_ENABLE
|
|
help
|
|
This option provides enhanced support for discontiguous
|
|
memory systems, over FLATMEM. These systems have holes
|
|
in their physical address spaces, and this option provides
|
|
more efficient handling of these holes. However, the vast
|
|
majority of hardware has quite flat address spaces, and
|
|
can have degraded performance from extra overhead that
|
|
this option imposes.
|
|
|
|
Many NUMA configurations will have this as the only option.
|
|
|
|
If unsure, choose "Flat Memory" over this option.
|
|
|
|
config SPARSEMEM_MANUAL
|
|
bool "Sparse Memory"
|
|
depends on ARCH_SPARSEMEM_ENABLE
|
|
help
|
|
This will be the only option for some systems, including
|
|
memory hotplug systems. This is normal.
|
|
|
|
For many other systems, this will be an alternative to
|
|
"Discontiguous Memory". This option provides some potential
|
|
performance benefits, along with decreased code complexity,
|
|
but it is newer, and more experimental.
|
|
|
|
If unsure, choose "Discontiguous Memory" or "Flat Memory"
|
|
over this option.
|
|
|
|
endchoice
|
|
|
|
config DISCONTIGMEM
|
|
def_bool y
|
|
depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL
|
|
|
|
config SPARSEMEM
|
|
def_bool y
|
|
depends on SPARSEMEM_MANUAL
|
|
|
|
config FLATMEM
|
|
def_bool y
|
|
depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL
|
|
|
|
config FLAT_NODE_MEM_MAP
|
|
def_bool y
|
|
depends on !SPARSEMEM
|
|
|
|
#
|
|
# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
|
|
# to represent different areas of memory. This variable allows
|
|
# those dependencies to exist individually.
|
|
#
|
|
config NEED_MULTIPLE_NODES
|
|
def_bool y
|
|
depends on DISCONTIGMEM || NUMA
|
|
|
|
config HAVE_MEMORY_PRESENT
|
|
def_bool y
|
|
depends on ARCH_HAVE_MEMORY_PRESENT || SPARSEMEM
|
|
|
|
#
|
|
# SPARSEMEM_EXTREME (which is the default) does some bootmem
|
|
# allocations when memory_present() is called. If this cannot
|
|
# be done on your architecture, select this option. However,
|
|
# statically allocating the mem_section[] array can potentially
|
|
# consume vast quantities of .bss, so be careful.
|
|
#
|
|
# This option will also potentially produce smaller runtime code
|
|
# with gcc 3.4 and later.
|
|
#
|
|
config SPARSEMEM_STATIC
|
|
def_bool n
|
|
|
|
#
|
|
# Architecture platforms which require a two level mem_section in SPARSEMEM
|
|
# must select this option. This is usually for architecture platforms with
|
|
# an extremely sparse physical address space.
|
|
#
|
|
config SPARSEMEM_EXTREME
|
|
def_bool y
|
|
depends on SPARSEMEM && !SPARSEMEM_STATIC
|
|
|
|
# eventually, we can have this option just 'select SPARSEMEM'
|
|
config MEMORY_HOTPLUG
|
|
bool "Allow for memory hot-add"
|
|
depends on SPARSEMEM || X86_64_ACPI_NUMA
|
|
depends on HOTPLUG && !SOFTWARE_SUSPEND && ARCH_ENABLE_MEMORY_HOTPLUG
|
|
depends on (IA64 || X86 || PPC64)
|
|
|
|
comment "Memory hotplug is currently incompatible with Software Suspend"
|
|
depends on SPARSEMEM && HOTPLUG && SOFTWARE_SUSPEND
|
|
|
|
config MEMORY_HOTPLUG_SPARSE
|
|
def_bool y
|
|
depends on SPARSEMEM && MEMORY_HOTPLUG
|
|
|
|
# Heavily threaded applications may benefit from splitting the mm-wide
|
|
# page_table_lock, so that faults on different parts of the user address
|
|
# space can be handled with less contention: split it at this NR_CPUS.
|
|
# Default to 4 for wider testing, though 8 might be more appropriate.
|
|
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
|
|
# PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
|
|
#
|
|
config SPLIT_PTLOCK_CPUS
|
|
int
|
|
default "4096" if ARM && !CPU_CACHE_VIPT
|
|
default "4096" if PARISC && !PA20
|
|
default "4"
|
|
|
|
#
|
|
# support for page migration
|
|
#
|
|
config MIGRATION
|
|
bool "Page migration"
|
|
def_bool y
|
|
depends on NUMA
|
|
help
|
|
Allows the migration of the physical location of pages of processes
|
|
while the virtual addresses are not changed. This is useful for
|
|
example on NUMA systems to put pages nearer to the processors accessing
|
|
the page.
|
|
|
|
config RESOURCES_64BIT
|
|
bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL)
|
|
default 64BIT
|
|
help
|
|
This option allows memory and IO resources to be 64 bit.
|
|
|
|
config ZONE_DMA_FLAG
|
|
int
|
|
default "0" if !ZONE_DMA
|
|
default "1"
|
|
|
|
config NR_QUICK
|
|
int
|
|
depends on QUICKLIST
|
|
default "1"
|
|
|