kernel-ark/mm
Vlastimil Babka 01cc2e5869 mm: munlock: fix potential race with THP page split
Since commit ff6a6da60b ("mm: accelerate munlock() treatment of THP
pages") munlock skips tail pages of a munlocked THP page.  There is some
attempt to prevent bad consequences of racing with a THP page split, but
code inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.

First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages.  Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part
of tail pages left with PageMlocked flag.  As the head page still
appears to be a THP page until all tail pages are processed,
munlock_vma_page() might think it munlocked the whole THP page and skip
all the former tail pages.  Before ff6a6da60, those pages would be
cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
would still become undercounted (related the next point).

Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
the PageMlocked is cleared.  The accounting might also become
inconsistent due to race with __split_huge_page_refcount()

- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
  left with PageMlocked set and counted again (only possible before
  ff6a6da60)

- overcount when hpage_nr_pages() sees a normal page (split has already
  finished), but the parallel split has meanwhile cleared PageMlocked from
  additional tail pages

This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page().  This is convenient because:

- __split_huge_page_refcount() takes lru_lock for its whole operation

- munlock_vma_page() typically takes lru_lock anyway for page isolation

As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.

[akpm@linux-foundation.org: avoid a coding-style ugly]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:50 -08:00
..
backing-dev.c
balloon_compaction.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
bootmem.c mm/bootmem.c: remove unused local `map' 2013-11-13 12:09:09 +09:00
bounce.c
cleancache.c
compaction.c mm: compaction: reset scanner positions immediately when they meet 2014-01-21 16:19:49 -08:00
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap_xip.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
filemap.c mm: drop actor argument of do_generic_file_read() 2013-11-15 09:32:13 +09:00
fremap.c mm: fix use-after-free in sys_remap_file_pages 2014-01-02 14:40:30 -08:00
frontswap.c
highmem.c
huge_memory.c thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only 2014-01-12 16:47:15 +07:00
hugetlb_cgroup.c hugetlb_cgroup: convert away from cftype->read() 2013-12-05 12:28:03 -05:00
hugetlb.c mm/hugetlb.c: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
hwpoison-inject.c mm/hwpoison: add '#' to hwpoison_inject 2014-01-21 16:19:48 -08:00
init-mm.c
internal.h mm: thp: __get_page_tail_foll() can use get_huge_page_tail() 2014-01-21 16:19:43 -08:00
interval_tree.c
Kconfig mm: add missing dependency in Kconfig 2013-12-18 19:04:52 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c mm: kmemleak: avoid false negatives on vmalloc'ed objects 2013-11-13 12:09:07 +09:00
ksm.c mm/rmap: use rmap_walk() in page_referenced() 2014-01-21 16:19:45 -08:00
list_lru.c mm: list_lru: fix almost infinite loop causing effective livelock 2013-10-30 12:57:46 -07:00
maccess.c
madvise.c mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood 2013-09-30 14:31:02 -07:00
Makefile
memblock.c mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter 2014-01-21 16:19:48 -08:00
memcontrol.c Merge branch 'akpm' (incoming from Andrew) 2014-01-21 19:05:45 -08:00
memory_hotplug.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
memory-failure.c mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages 2014-01-21 16:19:49 -08:00
memory.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
mempolicy.c mm/mempolicy: fix !vma in new_vma_page() 2013-12-18 19:04:52 -08:00
mempool.c
migrate.c mm/migrate: remove unused function, fail_migrate_page() 2014-01-21 16:19:49 -08:00
mincore.c
mlock.c mm: munlock: fix potential race with THP page split 2014-01-23 16:36:50 -08:00
mm_init.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mmap.c mm/mmap.c: add mlock_future_check() helper 2014-01-21 16:19:44 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mprotect.c mm: numa: do not automatically migrate KSM pages 2014-01-21 16:19:48 -08:00
mremap.c mm: revert mremap pud_free anti-fix 2013-10-16 21:35:53 -07:00
msync.c
nobootmem.c mm/memblock: switch to use NUMA_NO_NODE instead of MAX_NUMNODES 2014-01-21 16:19:46 -08:00
nommu.c mm: add overcommit_kbytes sysctl variable 2014-01-21 16:19:44 -08:00
oom_kill.c oom_kill: add rcu_read_lock() into find_lock_task_mm() 2014-01-21 16:19:46 -08:00
page_alloc.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
page_cgroup.c Merge branch 'akpm' (incoming from Andrew) 2014-01-21 19:05:45 -08:00
page_io.c
page_isolation.c
page-writeback.c writeback: fix negative bdi max pause 2013-10-16 21:35:53 -07:00
pagewalk.c mm/pagewalk.c: fix walk_page_range() access of wrong PTEs 2013-10-30 14:27:03 -07:00
percpu-km.c
percpu-vm.c
percpu.c Merge branch 'akpm' (incoming from Andrew) 2014-01-21 19:05:45 -08:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2013-12-18 19:04:51 -08:00
process_vm_access.c
quicklist.c
readahead.c readahead: fix sequential read cache miss detection 2013-11-13 12:09:09 +09:00
rmap.c mm/rmap: use rmap_walk() in page_mkclean() 2014-01-21 16:19:46 -08:00
shmem.c security: shmem: implement kernel private shmem inodes 2013-12-02 11:24:19 +00:00
slab_common.c memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx 2013-11-13 12:09:10 +09:00
slab.c Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2013-11-22 08:10:34 -08:00
slab.h memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx 2013-11-13 12:09:10 +09:00
slob.c
slub.c Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2013-11-22 08:10:34 -08:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
swap_state.c
swap.c mm/swap.c: reorganize put_compound_page() 2014-01-21 16:19:43 -08:00
swapfile.c frontswap: enable call to invalidate area on swapoff 2013-11-13 12:09:07 +09:00
truncate.c
util.c mm: add overcommit_kbytes sysctl variable 2014-01-21 16:19:44 -08:00
vmalloc.c mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page} 2014-01-21 16:19:44 -08:00
vmpressure.c memcg: make cgroup_event deal with mem_cgroup instead of cgroup_subsys_state 2013-11-22 18:20:43 -05:00
vmscan.c mm/vmscan.c: don't forget to free shrinker->nr_deferred 2013-10-16 21:35:52 -07:00
vmstat.c mm: numa: return the number of base pages altered by protection changes 2013-11-13 12:09:11 +09:00
zbud.c
zswap.c mm/zswap.c: change params from hidden to ro 2014-01-23 16:36:50 -08:00