kernel/mm-do-not-stall-in-synchronous-compaction-for-THP-allocations.patch

https://lkml.org/lkml/2011/11/10/173

Date	Thu, 10 Nov 2011 10:06:16 +0000
From	Mel Gorman <>
Subject	[PATCH] mm: Do not stall in synchronous compaction for THP allocations
	

Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;

Andy Isaacson reported a problem copying to VFAT on SD Card
	https://lkml.org/lkml/2011/11/7/2

	In this case, it was stuck in munmap for betwen 20 and 60
	seconds in compaction. It is also possible that khugepaged
	was holding mmap_sem on this process if CONFIG_NUMA was set.

Johannes Weiner reported stalls on USB
	https://lkml.org/lkml/2011/7/25/378

	In this case, there is no stack trace but it looks like the
	same problem. The USB stick may have been using NTFS as a
	filesystem based on other work done related to writing back
	to USB around the same time.

Internally in SUSE, I received a bug report related to stalls in firefox
	when using Java and Flash heavily while copying from NFS
	to VFAT on USB. It has not been confirmed to be the same problem
	but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.

While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed

Time stalled in this event:    47369 ms
Event count:                      20
usemem               sleep_on_page          3690 ms
usemem               sleep_on_page          2148 ms
usemem               sleep_on_page          1534 ms
usemem               sleep_on_page          1518 ms
usemem               sleep_on_page          1225 ms
usemem               sleep_on_page          2205 ms
usemem               sleep_on_page          2399 ms
usemem               sleep_on_page          2398 ms
usemem               sleep_on_page          3760 ms
usemem               sleep_on_page          1861 ms
usemem               sleep_on_page          2948 ms
usemem               sleep_on_page          1515 ms
usemem               sleep_on_page          1386 ms
usemem               sleep_on_page          1882 ms
usemem               sleep_on_page          1850 ms
usemem               sleep_on_page          3715 ms
usemem               sleep_on_page          3716 ms
usemem               sleep_on_page          4846 ms
usemem               sleep_on_page          1306 ms
usemem               sleep_on_page          1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.

This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall. It was suggested that __GFP_NORETRY be used instead of
__GFP_NO_KSWAPD. This would look less like a special case but would
still cause compaction to run at least once with sync compaction.

If accepted, this is a -stable candidate.

Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..84bf962 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2168,7 +2168,13 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+
+	/*
+	 * Do not use sync migration for transparent hugepage allocations as
+	 * it could stall writing back pages which is far worse than simply
+	 * failing to promote a page.
+	 */
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 22:12:22 +00:00			`https://lkml.org/lkml/2011/11/10/173`

			`Date Thu, 10 Nov 2011 10:06:16 +0000`
			`From Mel Gorman <>`
			`Subject [PATCH] mm: Do not stall in synchronous compaction for THP allocations`


			`Occasionally during large file copies to slow storage, there are still`
			`reports of user-visible stalls when THP is enabled. Reports on this`
			`have been intermittent and not reliable to reproduce locally but;`

			`Andy Isaacson reported a problem copying to VFAT on SD Card`
			`https://lkml.org/lkml/2011/11/7/2`

			`In this case, it was stuck in munmap for betwen 20 and 60`
			`seconds in compaction. It is also possible that khugepaged`
			`was holding mmap_sem on this process if CONFIG_NUMA was set.`

			`Johannes Weiner reported stalls on USB`
			`https://lkml.org/lkml/2011/7/25/378`

			`In this case, there is no stack trace but it looks like the`
			`same problem. The USB stick may have been using NTFS as a`
			`filesystem based on other work done related to writing back`
			`to USB around the same time.`

			`Internally in SUSE, I received a bug report related to stalls in firefox`
			`when using Java and Flash heavily while copying from NFS`
			`to VFAT on USB. It has not been confirmed to be the same problem`
			`but if it looks like a duck and quacks like a duck.....`
			`In the past, commit [11bc82d6: mm: compaction: Use async migration for`
			`__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction`
			`would never be used for THP allocations. This was reverted in commit`
			`[c6a140bf: mm/compaction: reverse the change that forbade sync`
			`migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain`
			`it was beneficial.`

			`While user-visible stalls do not happen for me when writing to USB,`
			`I setup a test running postmark while short-lived processes created`
			`anonymous mapping. The objective was to exercise the paths that`
			`allocate transparent huge pages. I then logged when processes were`
			`stalled for more than 1 second, recorded a stack strace and did some`
			`analysis to aggregate unique "stall events" which revealed`

			`Time stalled in this event: 47369 ms`
			`Event count: 20`
			`usemem sleep_on_page 3690 ms`
			`usemem sleep_on_page 2148 ms`
			`usemem sleep_on_page 1534 ms`
			`usemem sleep_on_page 1518 ms`
			`usemem sleep_on_page 1225 ms`
			`usemem sleep_on_page 2205 ms`
			`usemem sleep_on_page 2399 ms`
			`usemem sleep_on_page 2398 ms`
			`usemem sleep_on_page 3760 ms`
			`usemem sleep_on_page 1861 ms`
			`usemem sleep_on_page 2948 ms`
			`usemem sleep_on_page 1515 ms`
			`usemem sleep_on_page 1386 ms`
			`usemem sleep_on_page 1882 ms`
			`usemem sleep_on_page 1850 ms`
			`usemem sleep_on_page 3715 ms`
			`usemem sleep_on_page 3716 ms`
			`usemem sleep_on_page 4846 ms`
			`usemem sleep_on_page 1306 ms`
			`usemem sleep_on_page 1467 ms`
			`[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80`
			`[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360`
			`[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0`
			`[<ffffffff81134273>] compact_zone+0x1f3/0x2f0`
			`[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0`
			`[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110`
			`[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0`
			`[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0`
			`[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0`
			`[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160`
			`[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270`
			`[<ffffffff814410a7>] do_page_fault+0x207/0x4c0`
			`[<ffffffff8143dde5>] page_fault+0x25/0x30`
			`The stall times are approximate at best but the estimates represent 25%`
			`of the worst stalls and even if the estimates are off by a factor of`
			`10, it's severe.`

			`This patch once again prevents sync migration for transparent`
			`hugepage allocations as it is preferable to fail a THP allocation`
			`than stall. It was suggested that __GFP_NORETRY be used instead of`
			`__GFP_NO_KSWAPD. This would look less like a special case but would`
			`still cause compaction to run at least once with sync compaction.`

			`If accepted, this is a -stable candidate.`

			`Reported-by: Andy Isaacson <adi@hexapodia.org>`
			`Reported-by: Johannes Weiner <hannes@cmpxchg.org>`
			`Signed-off-by: Mel Gorman <mgorman@suse.de>`
			`---`

			`diff --git a/mm/page_alloc.c b/mm/page_alloc.c`
			`index 9dd443d..84bf962 100644`
			`--- a/mm/page_alloc.c`
			`+++ b/mm/page_alloc.c`
			`@@ -2168,7 +2168,13 @@ rebalance:`
			`sync_migration);`
			`if (page)`
			`goto got_pg;`
			`- sync_migration = true;`
			`+`
			`+ /*`
			`+ * Do not use sync migration for transparent hugepage allocations as`
			`+ * it could stall writing back pages which is far worse than simply`
			`+ * failing to promote a page.`
			`+ */`
			`+ sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);`

			`/* Try direct reclaim and then allocating */`
			`page = __alloc_pages_direct_reclaim(gfp_mask, order,`