Commit Graph

27 Commits

Author SHA1 Message Date
Nick Piggin
a482289d46 [PATCH] hugepage allocator cleanup
Insert "fresh" huge pages into the hugepage allocator by the same means as
they are freed back into it.  This reduces code size and allows
enqueue_huge_page to be inlined into the hugepage free fastpath.

Eliminate occurances of hugepages on the free list with non-zero refcount.
This can allow stricter refcount checks in future.  Also required for
lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

"This patch also eliminates a leak "cleaned up" by re-clobbering the
refcount on every allocation from the hugepage freelists.  With respect to
the lockless pagecache, the crucial aspect is to eliminate unconditional
set_page_count() to 0 on pages with potentially nonzero refcounts, though
closer inspection suggests the assignments removed are entirely spurious."

Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:58 -08:00
Hugh Dickins
41d78ba550 [PATCH] compound page: use page[1].lru
If a compound page has its own put_page_testzero destructor (the only current
example is free_huge_page), that is noted in page[1].mapping of the compound
page.  But that's rather a poor place to keep it: functions which call
set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
__ib_umem_release) ought to be checking first, otherwise set_page_dirty is
liable to crash on what's not the address of a struct address_space.

And now I'm about to make that worse: it turns out that every compound page
needs a destructor, so we can no longer rely on hugetlb pages going their own
special way, to avoid further problems of page->mapping reuse.  For example,
not many people know that: on 50% of i386 -Os builds, the first tail page of a
compound page purports to be PageAnon (when its destructor has an odd
address), which surprises page_add_file_rmap.

Keep the compound page destructor in page[1].lru.next instead.  And to free up
the common pairing of mapping and index, also move compound page order from
index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
compound pages, it can easily stack its use above this.

(akpm: decoded version of the above: the tail pages of a compound page now
have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
caller to check that they're not compund pages before doing the dirty).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Christoph Lameter
0df420d8b6 [PATCH] hugetlbpage: return VM_FAULT_OOM on oom
Remove wrong and misleading comments.

Return VM_FAULT_OOM if the hugetlbpage fault handler cannot allocate a
page.  do_no_page will end up doing do_exit(SIGKILL).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:31 -08:00
David Gibson
a2dfef6947 [PATCH] Hugepages need clear_user_highpage() not clear_highpage()
When hugepages are newly allocated to a file in mm/hugetlb.c, we clear them
with a call to clear_highpage() on each of the subpages.  We should be
using clear_user_highpage(): on powerpc, at least, clear_highpage() doesn't
correctly mark the page as icache dirty so if the page is executed shortly
after it's possible to get strange results.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:31 -08:00
Christoph Lameter
64b4a954b0 [PATCH] hugetlb: add comment explaining reasons for Bus Errors
I just spent some time researching a Bus Error.  Turns out that the huge
page fault handler can return VM_FAULT_SIGBUS for various conditions where
no huge page is available.

Add a note explaining the reasoning in the source.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Christoph Lameter
aea47ff363 [PATCH] mm: make hugepages obey cpusets.
See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2

Make hugepages obey cpusets.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:43 -08:00
Christoph Lameter
6bda666a03 [PATCH] hugepages: fold find_or_alloc_pages into huge_no_page()
The number of parameters for find_or_alloc_page increases significantly after
policy support is added to huge pages.  Simplify the code by folding
find_or_alloc_huge_page() into hugetlb_no_page().

Adam Litke objected to this piece in an earlier patch but I think this is a
good simplification.  Diffstat shows that we can get rid of almost half of the
lines of find_or_alloc_page().  If we can find no consensus then lets simply
drop this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:23 -08:00
Christoph Lameter
5da7ca8607 [PATCH] Add NUMA policy support for huge pages.
The huge_zonelist() function in the memory policy layer provides an list of
zones ordered by NUMA distance.  The hugetlb layer will walk that list looking
for a zone that has available huge pages but is also in the nodeset of the
current cpuset.

This patch does not contain the folding of find_or_alloc_huge_page() that was
controversial in the earlier discussion.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:23 -08:00
Christoph Lameter
96df9333c9 [PATCH] mm: dequeue a huge page near to this node
This was discussed at
http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2

This patch changes the dequeueing to select a huge page near the node
executing instead of always beginning to check for free nodes from node 0.
This will result in a placement of the huge pages near the executing
processor improving performance.

The existing implementation can place the huge pages far away from the
executing processor causing significant degradation of performance.  The
search starting from zero also means that the lower zones quickly run out
of memory.  Selecting a huge page near the process distributed the huge
pages better.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:23 -08:00
David Gibson
1e8f889b10 [PATCH] Hugetlb: Copy on Write support
Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
supported.  This helps us to safely use hugetlb pages in many more
applications.  The patch makes the following changes.  If needed, I also have
it broken out according to the following paragraphs.

1. Add a pair of functions to set/clear write access on huge ptes.  The
   writable check in make_huge_pte is moved out to the caller for use by COW
   later.

2. Hugetlb copy-on-write requires special case handling in the following
   situations:

   - copy_hugetlb_page_range() - Copied pages must be write protected so
     a COW fault will be triggered (if necessary) if those pages are written
     to.

   - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
     page cache.  MAP_PRIVATE pages still need to be locked however.

3. Provide hugetlb_cow() and calls from hugetlb_fault() and
   hugetlb_no_page() which handles the COW fault by making the actual copy.

4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
   will be allowed.  Make MAP_HUGETLB exempt from the depricated VM_RESERVED
   mapping check.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:23 -08:00
Adam Litke
86e5216f8d [PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COW
This patch splits the "no_page()" type activity into its own function,
hugetlb_no_page().  hugetlb_fault() becomes the entry point for hugetlb faults
and delegates to the appropriate handler depending on the type of fault.
Right now we still have only hugetlb_no_page() but a later patch introduces a
COW fault.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:22 -08:00
Adam Litke
85ef47f74a [PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page
find_lock_huge_page() isn't a great name, since it does extra things not
analagous to find_lock_page().  Rename it find_or_alloc_huge_page() which is
closer to the mark.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:22 -08:00
Adam Litke
f0916794f0 [PATCH] Hugetlb: Remove duplicate i_size check
cleanup

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:22 -08:00
Eric Paris
0bd0f9fb19 [PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of nr_huge_pages
If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
it is possible for the nr_huge_pages variable to become incorrect.  There
is no locking in the set_max_huge_pages function around
alloc_fresh_huge_page which is able to update nr_huge_pages.  Two callers
to alloc_fresh_huge_page could race against each other as could a call to
alloc_fresh_huge_page and a call to update_and_free_page.  This patch just
expands the area covered by the hugetlb_lock to cover the call into
alloc_fresh_huge_page.  I'm not sure how we could say that a sysctl section
is performance critical where more specific locking would be needed.

My reproducer was to run a couple copies of the following script
simultaneously

while [ true ]; do
	echo 1000 > /proc/sys/vm/nr_hugepages
	echo 500 > /proc/sys/vm/nr_hugepages
	echo 750 > /proc/sys/vm/nr_hugepages
	echo 100 > /proc/sys/vm/nr_hugepages
	echo 0 > /proc/sys/vm/nr_hugepages
done

and then watch /proc/meminfo and eventually you will see things like

HugePages_Total:     100
HugePages_Free:      109

After applying the patch all seemed well.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:43 -08:00
Adrian Bunk
99697dc02d [PATCH] unexport hugetlb_total_pages
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:06 -08:00
Benjamin Herrenschmidt
3c726f8dee [PATCH] ppc64: support 64k pages
Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
base page size to 64K.  The resulting kernel still boots on any
hardware.  On current machines with 4K pages support only, the kernel
will maintain 16 "subpages" for each 64K page transparently.

Note that while real 64K capable HW has been tested, the current patch
will not enable it yet as such hardware is not released yet, and I'm
still verifying with the firmware architects the proper to get the
information from the newer hypervisors.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-06 16:56:47 -08:00
Adam Litke
4c88726597 [PATCH] hugetlb: demand fault handler
Below is a patch to implement demand faulting for huge pages.  The main
motivation for changing from prefaulting to demand faulting is so that huge
page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler.  The bulk of the patch just moves the logic from
hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:43 -07:00
Hugh Dickins
508034a32b [PATCH] mm: unmap_vmas with inner ptlock
Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
c74df32c72 [PATCH] mm: ptd_alloc take ptlock
Second step in pushing down the page_table_lock.  Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code.  But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate.  It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
365e9c87a9 [PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability.  Originally it was called whenever rss or
total_vm got raised.  Then many of those callsites were replaced by a timer
tick call from account_system_time.  Now Frank van Maarseveen reports that to
be found inadequate.  How about this?  Works for Frank.

Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths.  Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit.  Handle
mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.

And there has been no collector of these hiwater statistics in the tree.  The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).

There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high.  A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.

What locking?  None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact.  But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:39 -07:00
Hugh Dickins
4294621f41 [PATCH] mm: rss = file_rss + anon_rss
I was lazy when we added anon_rss, and chose to change as few places as
possible.  So currently each anonymous page has to be counted twice, in rss
and in anon_rss.  Which won't be so good if those are atomic counts in some
configurations.

Change that around: keep file_rss and anon_rss separately, and add them
together (with get_mm_rss macro) when the total is needed - reading two
atomics is much cheaper than updating two atomics.  And update anon_rss
upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Hugh Dickins
ac9b9c667c [PATCH] Fix handling spurious page fault for hugetlb region
This reverts commit 3359b54c8c and
replaces it with a cleaner version that is purely based on page table
operations, so that the synchronization between inode size and hugetlb
mappings becomes moot.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-20 09:02:07 -07:00
Hugh Dickins
1c59827d1d [PATCH] mm: hugetlb truncation fixes
hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
forgets that: crashes and misaccounting ensue.

copy_hugetlb_page_range better grab the src page_table_lock since we don't
want to guess what happens if concurrently truncated.  unmap_hugepage_range
rss accounting must not assume the full range was mapped.  follow_hugetlb_page
must guard with page_table_lock and be prepared to exit early.

Restyle copy_hugetlb_page_range with a for loop like the others there.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19 23:04:30 -07:00
Adam Litke
7bf07f3d4b [PATCH] hugetlb: move stale pte check into huge_pte_alloc()
Initial Post (Wed, 17 Aug 2005)

This patch moves the
	if (! pte_none(*pte))
		hugetlb_clean_stale_pgtable(pte);
logic into huge_pte_alloc() so all of its callers can be immune to the bug
described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

> It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
> huge_pte_alloc() might return a pmd that points to a PTE page. It happens
> if the virtual address for hugetlb mmap is recycled from previously used
> normal page mmap. free_pgtables() might not scrub the pmd entry on
> munmap and hugetlb_prefault skips on any pmd presence regardless what type
> it is.

Unless I am missing something, it seems more correct to place the check inside
huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
It also allows checking for this condition when lazily faulting huge pages
later in the series.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:46 -07:00
David Gibson
c7546f8f03 [PATCH] Fix hugepage crash on failing mmap()
This patch fixes a crash in the hugepage code.  unmap_hugepage_area() was
assuming that (due to prefault) PTEs must exist for all the area in
question.  However, this may not be the case, if mmap() encounters an error
before the prefault and calls unmap_region() to clean up any partial
mapping.

Depending on the hugepage configuration, this crash can be triggered by an
unpriveleged user.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-05 12:22:37 -07:00
David Gibson
63551ae0fe [PATCH] Hugepage consolidation
A lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This patch
attempts to consolidate a lot of the code across the arch's, putting the
combined version in mm/hugetlb.c.  There are a couple of uglyish hacks in
order to covert all the hugepage archs, but the result is a very large
reduction in the total amount of code.  It also means things like hugepage
lazy allocation could be implemented in one place, instead of six.

Tested, at least a little, on ppc64, i386 and x86_64.

Notes:
	- this patch changes the meaning of set_huge_pte() to be more
	  analagous to set_pte()
	- does SH4 need s special huge_ptep_get_and_clear()??

Acked-by: William Lee Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:15 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00