kernel-ark/mm
Michel Lespinasse fed067da46 mlock: only hold mmap_sem in shared mode when faulting in pages
Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in.  In the case of a large mlock, this can potentially take a
very long time, during which various commands such as 'ps auxw' will
block.  This makes sysadmins unhappy:

real    14m36.232s
user    0m0.003s
sys     0m0.015s
(output from 'time ps auxw' while a 20GB file was being mlocked without
being previously preloaded into page cache)

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs.  Then a second pass could be done
to actually mlock the pages, in small batches, releasing mmap_sem when we
block on disk access or when we detect some contention.

This patch:

Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in.  In the case of a large mlock, this can potentially
take a very long time.  Various things will block while mmap_sem is held,
including 'ps auxw'.  This can make sysadmins angry.

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs.  Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only.  We need to
recheck the vma flags after we re-acquire mmap_sem, but this is easy.

In the case where a vma has been munlocked before mlock completes, pages
that were already marked as PageMlocked() are handled by the munlock()
call, and mlock() is careful to not mark new page batches as PageMlocked()
after the munlock() call has cleared the VM_LOCKED vma flags.  So, the end
result will be identical to what'd happen if munlock() had executed after
the mlock() call.

In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that it
won't be held for long periods of time even in shared mode.

Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:35 -08:00
..
backing-dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-10-26 17:58:44 -07:00
bootmem.c x86, memblock: Replace e820_/_early string with memblock_ 2010-08-27 11:13:47 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
compaction.c mm: compaction: perform a faster migration scan when migrating asynchronously 2011-01-13 17:32:34 -08:00
debug-pagealloc.c
dmapool.c mm: add a might_sleep_if() to dma_pool_alloc() 2010-10-26 16:52:08 -07:00
fadvise.c
failslab.c
filemap_xip.c
filemap.c mm: clear PageError bit in msync & fsync 2011-01-13 17:32:35 -08:00
fremap.c Avoid pgoff overflow in remap_file_pages 2010-09-25 09:34:58 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
hugetlb.c mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() 2010-12-02 14:51:14 -08:00
hwpoison-inject.c HWPOISON, hugetlb: support hwpoison injection for hugepage 2010-08-11 09:23:11 +02:00
init-mm.c mm: provide init_mm mm_context initializer 2010-08-09 20:44:54 -07:00
internal.h mm: fix is_mem_section_removable() page_order BUG_ON check 2010-10-26 16:52:11 -07:00
Kconfig Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2010-10-22 17:31:36 -07:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: Fix typo in the comment 2010-08-08 21:57:23 +01:00
ksm.c ksm: annotate ksm_thread_mutex is no deadlock source 2010-12-02 14:51:15 -08:00
maccess.c MN10300: Save frame pointer in thread_info struct rather than global var 2010-10-27 17:29:01 +01:00
madvise.c
Makefile percpu: use percpu allocator on UP too 2010-10-02 10:26:05 +03:00
memblock.c memblock: Annotate memblock functions with __init_memblock 2010-10-11 16:00:52 -07:00
memcontrol.c memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check 2010-12-30 10:07:06 -08:00
memory_hotplug.c mm: migration: cleanup migrate_pages API by matching types for offlining and sync 2011-01-13 17:32:34 -08:00
memory-failure.c mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path 2011-01-13 17:32:34 -08:00
memory.c mlock: avoid dirtying pages and triggering writeback 2011-01-13 17:32:35 -08:00
mempolicy.c mm: migration: cleanup migrate_pages API by matching types for offlining and sync 2011-01-13 17:32:34 -08:00
mempool.c
migrate.c mm: migration: cleanup migrate_pages API by matching types for offlining and sync 2011-01-13 17:32:34 -08:00
mincore.c
mlock.c mlock: only hold mmap_sem in shared mode when faulting in pages 2011-01-13 17:32:35 -08:00
mm_init.c
mmap.c install_special_mapping skips security_file_mmap check. 2010-12-15 12:30:36 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c perf_events: Fix perf_counter_mmap() hook in mprotect() 2010-11-09 10:19:38 -08:00
mremap.c mm: remove pte_*map_nested() 2010-10-26 16:52:08 -07:00
msync.c
nommu.c nommu: Provide stubbed alloc/free_vm_area() implementation. 2010-12-24 12:08:30 +09:00
oom_kill.c oom: kill all threads sharing oom killed task's mm 2010-10-26 16:52:05 -07:00
page_alloc.c mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path 2011-01-13 17:32:34 -08:00
page_cgroup.c kmemleak: Annotate false positive in init_section_page_cgroup() 2010-07-19 11:54:14 +01:00
page_io.c block: unify flags for struct bio and struct request 2010-08-07 18:20:39 +02:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
page-writeback.c mm/page-writeback.c: fix __set_page_dirty_no_writeback() return value 2011-01-13 17:32:32 -08:00
pagewalk.c mm: remove call to find_vma in pagewalk for non-hugetlbfs 2010-11-25 06:50:46 +09:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: remove gfp mask from pcpu_get_vm_areas 2011-01-13 17:32:34 -08:00
percpu.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
prio_tree.c
quicklist.c
readahead.c
rmap.c mm/rmap.c: fix comment 2010-12-27 15:14:06 +01:00
shmem.c fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
slab.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 2011-01-10 08:38:01 -08:00
slob.c kernel: kmem_ptr_validate considered harmful 2011-01-07 17:50:16 +11:00
slub.c mm: convert sprintf_symbol to %pS 2011-01-13 17:32:33 -08:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c
swap_state.c
swap.c fuse: use release_pages() 2010-10-27 18:03:17 -07:00
swapfile.c block: make blkdev_get/put() handle exclusive access 2010-11-13 11:55:17 +01:00
thrash.c
truncate.c Call the filesystem back whenever a page is removed from the page cache 2010-12-02 09:55:21 -05:00
util.c kernel: kmem_ptr_validate considered harmful 2011-01-07 17:50:16 +11:00
vmalloc.c mm: unify module_alloc code for vmalloc 2011-01-13 17:32:34 -08:00
vmscan.c mm: vmscan: rename lumpy_mode to reclaim_mode 2011-01-13 17:32:34 -08:00
vmstat.c mm: vmstat: use a single setter function and callback for adjusting percpu thresholds 2011-01-13 17:32:31 -08:00