kernel-ark/mm
Christoph Lameter 6225e93735 Quicklists for page table pages
On x86_64 this cuts allocation overhead for page table pages down to a
fraction (kernel compile / editing load.  TSC based measurement of times spend
in each function):

no quicklist

pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
pgd_alloc                260022 1s(920ns/4us/263.1us)

quicklist:

pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)

pgd allocations are the most complex and there we see the most dramatic
improvement (may be we can cut down the amount of pgds cached somewhat?).  But
even the pte allocations still see a doubling of performance.

1. Proven code from the IA64 arch.

	The method used here has been fine tuned for years and
	is NUMA aware. It is based on the knowledge that accesses
	to page table pages are sparse in nature. Taking a page
	off the freelists instead of allocating a zeroed pages
	allows a reduction of number of cachelines touched
	in addition to getting rid of the slab overhead. So
	performance improves. This is particularly useful if pgds
	contain standard mappings. We can save on the teardown
	and setup of such a page if we have some on the quicklists.
	This includes avoiding lists operations that are otherwise
	necessary on alloc and free to track pgds.

2. Light weight alternative to use slab to manage page size pages

	Slab overhead is significant and even page allocator use
	is pretty heavy weight. The use of a per cpu quicklist
	means that we touch only two cachelines for an allocation.
	There is no need to access the page_struct (unless arch code
	needs to fiddle around with it). So the fast past just
	means bringing in one cacheline at the beginning of the
	page. That same cacheline may then be used to store the
	page table entry. Or a second cacheline may be used
	if the page table entry is not in the first cacheline of
	the page. The current code will zero the page which means
	touching 32 cachelines (assuming 128 byte). We get down
	from 32 to 2 cachelines in the fast path.

3. x86_64 gets lightweight page table page management.

	This will allow x86_64 arch code to faster repopulate pgds
	and other page table entries. The list operations for pgds
	are reduced in the same way as for i386 to the point where
	a pgd is allocated from the page allocator and when it is
	freed back to the page allocator. A pgd can pass through
	the quicklists without having to be reinitialized.

64 Consolidation of code from multiple arches

	So far arches have their own implementation of quicklist
	management. This patch moves that feature into the core allowing
	an easier maintenance and consistent management of quicklists.

Page table pages have the characteristics that they are typically zero or in a
known state when they are freed.  This is usually the exactly same state as
needed after allocation.  So it makes sense to build a list of freed page
table pages and then consume the pages already in use first.  Those pages have
already been initialized correctly (thus no need to zero them) and are likely
already cached in such a way that the MMU can use them most effectively.  Page
table pages are used in a sparse way so zeroing them on allocation is not too
useful.

Such an implementation already exits for ia64.  Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.  It
also only supported a single quicklist.  The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
 i386 needs two and thus has

config NR_QUICK
	int
	default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:

quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)

Page table pages can be freed using:

quicklist_free(<quicklist-nr>, <destructor>, <page>)

Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
..
allocpercpu.c
backing-dev.c [PATCH] nfs: fix congestion control 2007-03-16 19:25:05 -07:00
bootmem.c
bounce.c block: blk_max_pfn is somtimes wrong 2007-03-27 08:52:47 +02:00
fadvise.c
filemap_xip.c [PATCH] mm: fix xip issue with /dev/zero 2007-03-29 08:22:26 -07:00
filemap.c readahead: code cleanup 2007-05-07 12:12:52 -07:00
filemap.h
fremap.c [PATCH] mm: more rmap debugging 2006-12-22 08:55:49 -08:00
highmem.c [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages 2007-05-02 19:27:15 +02:00
hugetlb.c [PATCH] hugetlb: preserve hugetlb pte dirty state 2007-02-09 09:25:46 -08:00
internal.h Make page->private usable in compound pages 2007-05-07 12:12:53 -07:00
Kconfig Quicklists for page table pages 2007-05-07 12:12:54 -07:00
madvise.c [PATCH] holepunch: fix mmap_sem i_mutex deadlock 2007-03-29 08:22:26 -07:00
Makefile Quicklists for page table pages 2007-05-07 12:12:54 -07:00
memory_hotplug.c [PATCH] Fix sparsemem on Cell 2007-01-11 18:18:20 -08:00
memory.c Add unitialized_var() macro for suppressing gcc warnings 2007-05-07 12:12:52 -07:00
mempolicy.c [PATCH] Page migration: Fix vma flag checking 2007-03-05 07:57:51 -08:00
mempool.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
migrate.c page migration: fix NR_FILE_PAGES accounting 2007-04-24 08:23:08 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c
mmap.c [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction 2007-05-02 19:27:14 +02:00
mmzone.c
mprotect.c
mremap.c [PATCH] mm: mremap correct rmap accounting 2007-01-30 08:33:32 -08:00
msync.c
nommu.c [PATCH] nommu: fix bug ip_conntrack does not work on nommu 2007-04-12 15:31:42 -07:00
oom_kill.c allow oom_adj of saintly processes 2007-05-07 12:12:51 -07:00
page_alloc.c mm: optimize compound_head() by avoiding a shared page flag 2007-05-07 12:12:53 -07:00
page_io.c
page-writeback.c Use ZVC counters to establish exact size of dirtyable pages 2007-05-07 12:12:51 -07:00
pdflush.c
prio_tree.c
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c readahead: code cleanup 2007-05-07 12:12:52 -07:00
rmap.c [S390] split page_test_and_clear_dirty. 2007-04-27 16:01:46 +02:00
shmem_acl.c
shmem.c [PATCH] holepunch: fix disconnected pages after second truncate 2007-03-29 08:22:25 -07:00
slab.c Add virt_to_head_page and consolidate code in slab and slub 2007-05-07 12:12:54 -07:00
slob.c slab: introduce krealloc 2007-05-07 12:12:50 -07:00
slub.c slub: remove object activities out of checking functions 2007-05-07 12:12:54 -07:00
sparse.c
swap_state.c
swap.c Make page->private usable in compound pages 2007-05-07 12:12:53 -07:00
swapfile.c mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
thrash.c
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c [PATCH] VM: invalidate_inode_pages2_range() should not exit early 2007-03-01 14:53:39 -08:00
util.c
vmalloc.c [PATCH] x86-64: Fix vmalloc_32 to really allocate <4GB on 64bit platforms 2007-05-02 19:27:12 +02:00
vmscan.c [PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations 2007-03-01 14:53:38 -08:00
vmstat.c [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM 2007-02-11 10:51:18 -08:00