kernel-ark/mm
Nishanth Aravamudan a41f24ea9f page allocator: smarter retry of costly-order allocations
Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).

This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.

Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:05:58 -07:00
..
allocpercpu.c cpumask: Cleanup more uses of CPU_MASK and NODE_MASK 2008-04-19 19:44:58 +02:00
backing-dev.c mm/backing-dev.c: fix percpu_counter_destroy call bug in bdi_init 2007-12-05 09:21:18 -08:00
bootmem.c memory hotplug: make alloc_bootmem_section() 2008-04-28 08:58:25 -07:00
bounce.c block: Initial support for data-less (or empty) barrier support 2007-10-16 11:03:56 +02:00
dmapool.c dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too 2008-04-28 08:58:20 -07:00
fadvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap_xip.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap.c mm: rotate_reclaimable_page() cleanup 2008-04-28 08:58:20 -07:00
fremap.c mm: fix various kernel-doc comments 2008-03-19 18:53:35 -07:00
highmem.c mm: highmem kernel-doc additions 2008-03-19 18:53:35 -07:00
hugetlb.c mm: fix integer as NULL pointer warnings 2008-04-28 17:29:18 -07:00
internal.h memory hotplug: free memmaps allocated by bootmem 2008-04-28 08:58:26 -07:00
Kconfig PAGEFLAGS_EXTENDED and separate page flags for Head and Tail 2008-04-28 08:58:22 -07:00
maccess.c kgdb: fix optional arch functions and probe_kernel_* 2008-04-17 20:05:39 +02:00
madvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
Makefile uaccess: add probe_kernel_write() 2008-04-17 20:05:36 +02:00
memcontrol.c memcg: fix node_state handling 2008-04-08 18:25:53 -07:00
memory_hotplug.c mm/memory_hotplug.c must #include "internal.h" 2008-04-28 13:44:29 -07:00
memory.c mm: add vm_insert_mixed 2008-04-28 08:58:23 -07:00
mempolicy.c mempolicy: use struct mempolicy pointer in shmem_sb_info 2008-04-28 08:58:25 -07:00
mempool.c spelling fixes: mm/ 2007-10-20 01:27:18 +02:00
migrate.c memcg: fix VM_BUG_ON from page migration 2008-03-04 16:35:14 -08:00
mincore.c mm: remove nopage 2008-04-28 08:58:18 -07:00
mlock.c do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY 2007-07-16 09:05:37 -07:00
mmap.c mempolicy: rename mpol_copy to mpol_dup 2008-04-28 08:58:23 -07:00
mmzone.c mm: filter based on a nodemask as well as a gfp_mask 2008-04-28 08:58:19 -07:00
mprotect.c fix mprotect vma_wants_writenotify prot 2007-10-23 08:32:06 -07:00
mremap.c sparse pointer use of zero as null 2007-10-18 14:37:31 -07:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c mm/nommu.c: return 0 from kobjsize with invalid objects 2008-04-28 08:58:26 -07:00
oom_kill.c oom_kill: remove unused parameter in badness() 2008-04-28 08:58:26 -07:00
page_alloc.c page allocator: smarter retry of costly-order allocations 2008-04-29 08:05:58 -07:00
page_io.c mm: fix PageUptodate data race 2008-02-05 09:44:19 -08:00
page_isolation.c memory hotremove: unset migrate type "ISOLATE" after removal 2007-11-14 18:45:38 -08:00
page-writeback.c writeback: speed up writeback of big dirty files 2008-02-05 09:44:19 -08:00
pagewalk.c mm: fix possible off-by-one in walk_pte_range() 2008-04-28 08:58:16 -07:00
pdflush.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial 2008-04-21 16:36:46 -07:00
prio_tree.c spelling fixes: mm/ 2007-10-20 01:27:18 +02:00
quicklist.c quicklists: Only consider memory that can be used with GFP_KERNEL 2008-01-14 08:52:22 -08:00
readahead.c mm/readahead: fix kernel-doc notation 2008-03-19 18:53:37 -07:00
rmap.c mm: remove nopage 2008-04-28 08:58:18 -07:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
shmem.c mempolicy: use struct mempolicy pointer in shmem_sb_info 2008-04-28 08:58:25 -07:00
slab.c mm: move cache_line_size() to <linux/cache.h> 2008-04-28 08:58:19 -07:00
slob.c slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment. 2008-04-27 18:25:51 +03:00
slub.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 2008-04-28 14:08:56 -07:00
sparse-vmemmap.c NULL noise: fs/*, mm/*, kernel/* 2008-03-30 14:18:41 -07:00
sparse.c memory hotplug: free memmaps allocated by bootmem 2008-04-28 08:58:26 -07:00
swap_state.c mm: fix various kernel-doc comments 2008-03-19 18:53:35 -07:00
swap.c mm: rotate_reclaimable_page() cleanup 2008-04-28 08:58:20 -07:00
swapfile.c mm: try both endianess when checking for endianess 2008-04-28 08:58:19 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2008-03-25 08:57:47 -07:00
truncate.c fix invalidate_inode_pages2_range() to not clear ret 2008-04-28 08:58:18 -07:00
util.c fix mm/util.c:krealloc() 2007-11-14 18:45:41 -08:00
vmalloc.c vmallocinfo: add caller information 2008-04-28 08:58:21 -07:00
vmscan.c page allocator: smarter retry of costly-order allocations 2008-04-29 08:05:58 -07:00
vmstat.c vmstats: add cond_resched() to refresh_cpu_vm_stats() 2008-04-28 08:58:26 -07:00