Commit Graph

235 Commits

Author SHA1 Message Date
Cody P Schafer
55878e88c5 sparsemem: add BUILD_BUG_ON when sizeof mem_section is non-power-of-2
Instead of leaving a hidden trap for the next person who comes along and
wants to add something to mem_section, add a big fat warning about it
needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
to catch mistakes.

Right now non-power-of-2 mem_sections cause a number of WARNs at boot
(which don't clearly point to the size of mem_section as an issue), but
the system limps on (temporarily, at least).

This is based upon Dave Hansen's earlier RFC where he ran into the same
issue:
	"sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
	http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:40 -07:00
Jiang Liu
c3d5f5f0c2 mm: use a dedicated lock to protect totalram_pages and zone->managed_pages
Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
protect totalram_pages and zone->managed_pages.  Other than the memory
hotplug driver, totalram_pages and zone->managed_pages may also be
modified at runtime by other drivers, such as Xen balloon,
virtio_balloon etc.  For those cases, memory hotplug lock is a little
too heavy, so introduce a dedicated lock to protect totalram_pages and
zone->managed_pages.

Now we have a simplified locking rules totalram_pages and
zone->managed_pages as:

1) no locking for read accesses because they are unsigned long.
2) no locking for write accesses at boot time in single-threaded context.
3) serialize write accesses at runtime by acquiring the dedicated
   managed_page_count_lock.

Also adjust zone->managed_pages when freeing reserved pages into the
buddy system, to keep totalram_pages and zone->managed_pages in
consistence.

[akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Cody P Schafer
114d4b79f7 mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:29 -07:00
Cody P Schafer
72c3b51bda mm: fix comment referring to non-existent size_seqlock, change to span_seqlock
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:29 -07:00
Mel Gorman
283aba9f9e mm: vmscan: block kswapd if it is encountering pages under writeback
Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress.  This made no sense as the failure
to make progress could be completely independent of IO.  It was later
replaced by wait_iff_congested() and removed entirely by commit 258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().

This is problematic.  If kswapd encounters many pages under writeback
and it continues to scan until it reaches the high watermark then it
will quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer
was unable to write to the underlying BDI.  kswapd bypasses the BDI
congestion as it sets PF_SWAPWRITE but even if this was taken into
account then it would cause direct reclaimers to stall on writeback
which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback.  If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:28 -07:00
Mel Gorman
d43006d503 mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority
Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU
and the zone watermark which is no indication as to whether kswapd
should write pages or not.

This patch tracks if an excessive number of unqueued dirty pages are
being encountered at the end of the LRU.  If so, it indicates that dirty
pages are being recycled before flusher threads can clean them and flags
the zone so that kswapd will start writing pages until the zone is
balanced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:28 -07:00
Linus Torvalds
5d434fcb25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
  code cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
  mm: Convert print_symbol to %pSR
  gfs2: Convert print_symbol to %pSR
  m32r: Convert print_symbol to %pSR
  iostats.txt: add easy-to-find description for field 6
  x86 cmpxchg.h: fix wrong comment
  treewide: Fix typo in printk and comments
  doc: devicetree: Fix various typos
  docbook: fix 8250 naming in device-drivers
  pata_pdc2027x: Fix compiler warning
  treewide: Fix typo in printks
  mei: Fix comments in drivers/misc/mei
  treewide: Fix typos in kernel messages
  pm44xx: Fix comment for "CONFIG_CPU_IDLE"
  doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
  mmzone: correct "pags" to "pages" in comment.
  kernel-parameters: remove outdated 'noresidual' parameter
  Remove spurious _H suffixes from ifdef comments
  sound: Remove stray pluses from Kconfig file
  radio-shark: Fix printk "CONFIG_LED_CLASS"
  doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
  ...
2013-04-30 09:36:50 -07:00
Cody P Schafer
8761e31c22 mmzone: correct "pags" to "pages" in comment.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-03-27 14:12:14 +01:00
Russ Anderson
f9228b204f mm: zone_end_pfn is too small
Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output
below).

The key hint is "page 4294967296 outside zone".
4294967296 = 0x100000000 (bit 32 is set).

The problem is in include/linux/mmzone.h:

  530 static inline unsigned zone_end_pfn(const struct zone *zone)
  531 {
  532         return zone->zone_start_pfn + zone->spanned_pages;
  533 }

zone_end_pfn is "unsigned" (32 bits).  Changing it to "unsigned long"
(64 bits) fixes the problem.

zone_end_pfn() was added recently in commit 108bcc96ef ("mm: add & use
zone_end_pfn() and zone_spans_pfn()")

Output from the failure.

  No AGP bridge found
  page 4294967296 outside zone [ 4294967296 - 4327469056 ]
  ------------[ cut here ]------------
  kernel BUG at mm/page_alloc.c:552!
  invalid opcode: 0000 [#1] SMP
  Modules linked in:
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10
  RIP: free_one_page+0x382/0x430
  Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420)
  Call Trace:
    __free_pages_ok+0x96/0xb0
    __free_pages+0x25/0x50
    __free_pages_bootmem+0x8a/0x8c
    __free_memory_core+0xea/0x131
    free_low_memory_core_early+0x4a/0x98
    free_all_bootmem+0x45/0x47
    mem_init+0x7b/0x14c
    start_kernel+0x216/0x433
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x144/0x153
  Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49

Signed-off-by: Russ Anderson <rja@sgi.com>
Reported-by: George Beshers <gbeshers@sgi.com>
Acked-by: Hedi Berriche <hedi@sgi.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22 16:41:20 -07:00
Cody P Schafer
da3649e133 mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate.
Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
zone_*() functions.

Change node_end_pfn() to be a wrapper of pgdat_end_pfn().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:20 -08:00
Cody P Schafer
2a6e3ebee2 mm: add zone_is_empty() and zone_is_initialized()
Factoring out these 2 checks makes it more clear what we are actually
checking for.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:20 -08:00
Cody P Schafer
108bcc96ef mm: add & use zone_end_pfn() and zone_spans_pfn()
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.

This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.

Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:20 -08:00
Peter Zijlstra
bbeae5b05e mm: move page flags layout to separate header
This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header.  This
patch is necessary because otherwise there would be a circular
dependency between mm_types.h and mm.h.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:17 -08:00
Minchan Kim
194159fbcc mm: remove MIGRATE_ISOLATE check in hotpath
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option.  So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:15 -08:00
Bartlomiej Zolnierkiewicz
a458431e17 mm: fix zone_watermark_ok_safe() accounting of isolated pages
Commit 702d1a6e07 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.

Then later, commit 2139cbe627 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter.  It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).

This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.

Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04 16:11:46 -08:00
Linus Torvalds
3d59eebc5e Automatic NUMA Balancing V11
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
 Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
 vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
 xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
 DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
 72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
 YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
 3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
 hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
 CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
 BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
 Ka0JKgnWvsa6ez6FSzKI
 =ivQa
 -----END PGP SIGNATURE-----

Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
 "There are three implementations for NUMA balancing, this tree
  (balancenuma), numacore which has been developed in tip/master and
  autonuma which is in aa.git.

  In almost all respects balancenuma is the dumbest of the three because
  its main impact is on the VM side with no attempt to be smart about
  scheduling.  In the interest of getting the ball rolling, it would be
  desirable to see this much merged for 3.8 with the view to building
  scheduler smarts on top and adapting the VM where required for 3.9.

  The most recent set of comparisons available from different people are

    mel:    https://lkml.org/lkml/2012/12/9/108
    mingo:  https://lkml.org/lkml/2012/12/7/331
    tglx:   https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

  The results are a mixed bag.  In my own tests, balancenuma does
  reasonably well.  It's dumb as rocks and does not regress against
  mainline.  On the other hand, Ingo's tests shows that balancenuma is
  incapable of converging for this workloads driven by perf which is bad
  but is potentially explained by the lack of scheduler smarts.  Thomas'
  results show balancenuma improves on mainline but falls far short of
  numacore or autonuma.  Srikar's results indicate we all suffer on a
  large machine with imbalanced node sizes.

  My own testing showed that recent numacore results have improved
  dramatically, particularly in the last week but not universally.
  We've butted heads heavily on system CPU usage and high levels of
  migration even when it shows that overall performance is better.
  There are also cases where it regresses.  Of interest is that for
  specjbb in some configurations it will regress for lower numbers of
  warehouses and show gains for higher numbers which is not reported by
  the tool by default and sometimes missed in treports.  Recently I
  reported for numacore that the JVM was crashing with
  NullPointerExceptions but currently it's unclear what the source of
  this problem is.  Initially I thought it was in how numacore batch
  handles PTEs but I'm no longer think this is the case.  It's possible
  numacore is just able to trigger it due to higher rates of migration.

  These reports were quite late in the cycle so I/we would like to start
  with this tree as it contains much of the code we can agree on and has
  not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
  mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  mm: migrate: Account a transhuge page properly when rate limiting
  mm: numa: Account for failed allocations and isolations as migration failures
  mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
  mm: numa: Add THP migration for the NUMA working set scanning fault case.
  mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
  mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
  mm: sched: numa: Control enabling and disabling of NUMA balancing
  mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
  mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  mm: numa: migrate: Set last_nid on newly allocated page
  mm: numa: split_huge_page: Transfer last_nid on tail page
  mm: numa: Introduce last_nid to the page frame
  sched: numa: Slowly increase the scanning period as NUMA faults are handled
  mm: numa: Rate limit setting of pte_numa if node is saturated
  mm: numa: Rate limit the amount of memory that is migrated between nodes
  mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  mm: numa: Migrate pages handled during a pmd_numa hinting fault
  mm: numa: Migrate on reference policy
  ...
2012-12-16 15:18:08 -08:00
Jiang Liu
9feedc9d83 mm: introduce new field "managed_pages" to struct zone
Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.

	spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused.  The field zone->present_pages may
have different meanings in different contexts:

1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
  http://lkml.org/lkml/2012/11/5/866
  https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system".  And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:

	(spanned_pages - absent_pages - memmap_pages - dma_reserve)

Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 17:38:34 -08:00
Marek Szyprowski
bc357f431c mm: cma: remove watermark hacks
Commits 2139cbe627 ("cma: fix counting of isolated pages") and
d95ea5d18e ("cma: fix watermark checking") introduced a reliable
method of free page accounting when memory is being allocated from CMA
regions, so the workaround introduced earlier by commit 49f223a9cd
("mm: trigger page reclaim in alloc_contig_range() to stabilise
watermarks") can be finally removed.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:27 -08:00
Andrea Arcangeli
8177a420ed mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11 14:42:50 +00:00
Hugh Dickins
bea8c150a7 memcg: fix hotplugged memory zone oops
When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.  (On
many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded.  Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
  IP:  __mod_zone_page_state+0x9/0x60
  Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
  Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
  Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
   ...

The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone.  The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr.  For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too.  In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.

I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.

Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16 14:33:04 -08:00
Minchan Kim
e46a28790e CMA: migrate mlocked pages
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out.  Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running.  If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:23:00 +09:00
David Rientjes
957f822a0a mm, numa: reclaim from all nodes within reclaim distance
RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance.  This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
represents the set of nodes that are within this distance.  During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:56 +09:00
Mel Gorman
62997027ca mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning.  This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths.  Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever.  Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic.  There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start a
   full scan.  This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless.  For allocations that can fail such as THP,
they will simply fail.  For requests that cannot fail, they will retry the
allocation.  Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:51 +09:00
Mel Gorman
c89511ab2f mm: compaction: Restart compaction from near where it left off
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
 This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
 When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Mel Gorman
bb13ffeb9f mm: compaction: cache if a pageblock was scanned and no pages were isolated
When compaction was implemented it was known that scanning could
potentially be excessive.  The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line.  It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future.  When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive.  If the information is too stale then allocations will fail
that might have otherwise succeeded.  In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event.  Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure.  Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page.  The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Mel Gorman
753341a4b8 revert "mm: have order > 0 compaction start off where it left"
This reverts commit 7db8889ab0 ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages").  These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand.  A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Bartlomiej Zolnierkiewicz
d1ce749a0d cma: count free CMA pages
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok().  For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).

[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:44 +09:00
Mel Gorman
5515061d22 mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use.  If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00
Minchan Kim
702d1a6e07 memory-hotplug: fix kswapd looping forever problem
When hotplug offlining happens on zone A, it starts to mark freed page as
MIGRATE_ISOLATE type in buddy for preventing further allocation.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
we can't allocate them).

When the memory shortage happens during hotplug offlining, current task
starts to reclaim, then wake up kswapd.  Kswapd checks watermark, then go
sleep because current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count.  Current task continue to reclaim in
direct reclaim path without kswapd's helping.  The problem is that
zone->all_unreclaimable is set by only kswapd so that current task would
be looping forever like below.

__alloc_pages_slowpath
restart:
	wake_all_kswapd
rebalance:
	__alloc_pages_direct_reclaim
		do_try_to_free_pages
			if global_reclaim && !all_unreclaimable
				return 1; /* It means we did did_some_progress */
	skip __alloc_pages_may_oom
	should_alloc_retry
		goto rebalance;

If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
setting zone->all_unreclaimable, we can solve this problem by killing some
task in direct reclaim path.  But it doesn't wake up kswapd, still.  It
could be a problem still if other subsystem needs GFP_ATOMIC request.  So
kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
going sleep.

This patch counts the number of MIGRATE_ISOLATE page block and
zone_watermark_ok_safe will consider it if the system has such blocks
(fortunately, it's very rare so no problem in POV overhead and kswapd is
never hotpath).

Copy/modify from Mel's quote
"
Ideal solution would be "allocating" the pageblock.
It would keep the free space accounting as it is but historically,
memory hotplug didn't allocate pages because it would be difficult to
detect if a pageblock was isolated or if part of some balloon.
Allocating just full pageblocks would work around this, However,
it would play very badly with CMA.
"

[1] http://lkml.org/lkml/2012/6/14/74

[akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
[akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:45 -07:00
Jiang Liu
9adb62a5df mm/hotplug: correctly setup fallback zonelists when creating new pgdat
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node.  There's code to try
to achieve that in hotadd_new_pgdat() as below:

	/*
	 * The node we allocated has no zone fallback lists. For avoiding
	 * to access not-initialized zonelist, build here.
	 */
	mutex_lock(&zonelists_mutex);
	build_all_zonelists(pgdat, NULL);
	mutex_unlock(&zonelists_mutex);

But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet.  And build_all_zonelists() only builds zonelists for
online nodes as:

        for_each_online_node(nid) {
                pg_data_t *pgdat = NODE_DATA(nid);

                build_zonelists(pgdat);
                build_zonelist_cache(pgdat);
        }

Though we hope to create zonelist for the new pgdat, but it doesn't.  So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:44 -07:00
Rabin Vincent
fe03025db3 mm: CONFIG_HAVE_MEMBLOCK_NODE -> CONFIG_HAVE_MEMBLOCK_NODE_MAP
0ee332c145 ("memblock: Kill early_node_map[]") wanted to replace
CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but
ended up replacing one occurence with a reference to the non-existent
symbol CONFIG_HAVE_MEMBLOCK_NODE.

The resulting omission of code would probably have been causing problems
to 32-bit machines with memory hotplug.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Rik van Riel
7db8889ab0 mm: have order > 0 compaction start off where it left
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction.  This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Wanpeng Li
ca28ddc908 mm: remove unused LRU_ALL_EVICTABLE
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Andrew Morton
c255a45805 memcg: rename config variables
Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Linus Torvalds
d14b7a419a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Trivial updates all over the place as usual."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
  Fix typo in include/linux/clk.h .
  pci: hotplug: Fix typo in pci
  iommu: Fix typo in iommu
  video: Fix typo in drivers/video
  Documentation: Add newline at end-of-file to files lacking one
  arm,unicore32: Remove obsolete "select MISC_DEVICES"
  module.c: spelling s/postition/position/g
  cpufreq: Fix typo in cpufreq driver
  trivial: typo in comment in mksysmap
  mach-omap2: Fix typo in debug message and comment
  scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
  Change email address for Steve Glendinning
  Btrfs: fix typo in convert_extent_bit
  via: Remove bogus if check
  netprio_cgroup.c: fix comment typo
  backlight: fix memory leak on obscure error path
  Documentation: asus-laptop.txt references an obsolete Kconfig item
  Documentation: ManagementStyle: fixed typo
  mm/vmscan: cleanup comment error in balance_pgdat
  mm: cleanup on the comments of zone_reclaim_stat
  ...
2012-07-24 13:34:56 -07:00
Jiang Liu
d8adde17e5 memory hotplug: fix invalid memory access caused by stale kswapd pointer
kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP <ffff8806044f1d78>
  CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:41 -07:00
Jiri Kosina
59f91e5dd0 Merge branch 'master' into for-next
Conflicts:
	include/linux/mmzone.h

Synced with Linus' tree so that trivial patch can be applied
on top of up-to-date code properly.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
2012-06-29 14:45:58 +02:00
Wanpeng Li
46028e6d10 mm: cleanup on the comments of zone_reclaim_stat
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-06-28 11:56:12 +02:00
Konstantin Khlebnikov
7f5e86c2cc mm: add link from struct lruvec to struct zone
This is the first stage of struct mem_cgroup_zone removal.  Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:26 -07:00
Hugh Dickins
89abfab133 mm/memcg: move reclaim_stat into lruvec
With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:25 -07:00
Konstantin Khlebnikov
f3fd4a6192 mm: remove lru type checks from __isolate_lru_page()
After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list.  And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:25 -07:00
Linus Torvalds
d484864dd9 Merge branch 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
 "These patches contain two major updates for DMA mapping subsystem
  (mainly for ARM architecture).  First one is Contiguous Memory
  Allocator (CMA) which makes it possible for device drivers to allocate
  big contiguous chunks of memory after the system has booted.

  The main difference from the similar frameworks is the fact that CMA
  allows to transparently reuse the memory region reserved for the big
  chunk allocation as a system memory, so no memory is wasted when no
  big chunk is allocated.  Once the alloc request is issued, the
  framework migrates system pages to create space for the required big
  chunk of physically contiguous memory.

  For more information one can refer to nice LWN articles:

   - 'A reworked contiguous memory allocator':
		http://lwn.net/Articles/447405/

   - 'CMA and ARM':
		http://lwn.net/Articles/450286/

   - 'A deep dive into CMA':
		http://lwn.net/Articles/486301/

   - and the following thread with the patches and links to all previous
     versions:
		https://lkml.org/lkml/2012/4/3/204

  The main client for this new framework is ARM DMA-mapping subsystem.

  The second part provides a complete redesign in ARM DMA-mapping
  subsystem.  The core implementation has been changed to use common
  struct dma_map_ops based infrastructure with the recent updates for
  new dma attributes merged in v3.4-rc2.  This allows to use more than
  one implementation of dma-mapping calls and change/select them on the
  struct device basis.  The first client of this new infractructure is
  dmabounce implementation which has been completely cut out of the
  core, common code.

  The last patch of this redesign update introduces a new, experimental
  implementation of dma-mapping calls on top of generic IOMMU framework.
  This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
  calls if one provides required IOMMU hardware.

  For more information please refer to the following thread:
		http://www.spinics.net/lists/arm-kernel/msg175729.html

  The last patch merges changes from both updates and provides a
  resolution for the conflicts which cannot be avoided when patches have
  been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

Acked by Andrew Morton <akpm@linux-foundation.org>:
 "Yup, this one please.  It's had much work, plenty of review and I
  think even Russell is happy with it."

* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
  ARM: dma-mapping: use PMD size for section unmap
  cma: fix migration mode
  ARM: integrate CMA with DMA-mapping subsystem
  X86: integrate CMA with DMA-mapping subsystem
  drivers: add Contiguous Memory Allocator
  mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
  mm: extract reclaim code from __alloc_pages_direct_reclaim()
  mm: Serialize access to min_free_kbytes
  mm: page_isolation: MIGRATE_CMA isolation functions added
  mm: mmzone: MIGRATE_CMA migration type added
  mm: page_alloc: change fallbacks array handling
  mm: page_alloc: introduce alloc_contig_range()
  mm: compaction: export some of the functions
  mm: compaction: introduce isolate_freepages_range()
  mm: compaction: introduce map_pages()
  mm: compaction: introduce isolate_migratepages_range()
  mm: page_alloc: remove trailing whitespace
  ARM: dma-mapping: add support for IOMMU mapper
  ARM: dma-mapping: use alloc, mmap, free from dma_ops
  ARM: dma-mapping: remove redundant code and do the cleanup
  ...

Conflicts:
	arch/x86/include/asm/dma-mapping.h
2012-05-25 09:18:59 -07:00
Marek Szyprowski
49f223a9cd mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
2012-05-21 15:09:36 +02:00
Michal Nazarewicz
47118af076 mm: mmzone: MIGRATE_CMA migration type added
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory.  Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
2012-05-21 15:09:32 +02:00
Wang YanQing
35fca53e15 mmzone: fix comment typo coelesce -> coalesce
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-04-15 16:57:23 +02:00
Rik van Riel
aff622495c vmscan: only defer compaction for failed order and higher
Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone.  Even if
we only need compaction for an order 2 allocation, eg.  for jumbo frames
networking.

The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:56 -07:00
Hugh Dickins
4111304dab mm: enum lru_list lru
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:10 -08:00
Mel Gorman
c824493528 mm: compaction: make isolate_lru_page() filter-aware again
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list.  This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:09 -08:00
Johannes Weiner
6290df5458 mm: collect LRU list heads into struct lruvec
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:05 -08:00
Johannes Weiner
ab8fabd46f mm: exclude reserved pages from dirtyable memory
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

       +----------+ ---
       |   anon   |  |
       +----------+  |
       |          |  |
       |          |  -- dirty limit new    -- flusher new
       |   file   |  |                     |
       |          |  |                     |
       |          |  -- dirty limit old    -- flusher old
       |          |                        |
       +----------+                       --- reclaim
       | reserved |
       +----------+
       |  kernel  |
       +----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages.  The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free.  We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix.  In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:43 -08:00