kernel-ark

History

Eric B Munson 1aab92ec3d mm: mlock: refactor mlock, munlock, and munlockall code mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. Instead of forcing all locked pages to be present when they are allocated, this set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. This series introduces a new mlock() system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). There are two main use cases that this set covers. The first is the security focussed mlock case. A buffer is needed that cannot be written to swap. The maximum size is known, but on average the memory used is significantly less than this maximum. With lock on fault, the buffer is guaranteed to never be paged out without consuming the maximum size every time such a buffer is created. The second use case is focussed on performance. Portions of a large file are needed and we want to keep the used portions in memory once accessed. This is the case for large graphical models where the path through the graph is not known until run time. The entire graph is unlikely to be used in a given invocation, but once a node has been used it needs to stay resident for further processing. Given these constraints we have a number of options. We can potentially waste a large amount of memory by mlocking the entire region (this can also cause a significant stall at startup as the entire file is read in). We can mlock every page as we access them without tracking if the page is already resident but this introduces large overhead for each access. The third option is mapping the entire region with PROT_NONE and using a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the needed page. Doing this page at a time adds a significant performance penalty. Batching can be used to mitigate this overhead, but in order to safely avoid trying to mprotect pages outside of the mapping, the boundaries of each mapping to be used in this way must be tracked and available to the signal handler. This is precisely what the mm system in the kernel should already be doing. For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used. This decision was made to keep the accounting checks out of the page fault path. To illustrate the benefit of this set I wrote a test program that mmaps a 5 GB file filled with random data and then makes 15,000,000 accesses to random addresses in that mapping. The test program was run 20 times for each setup. Results are reported for two program portions, setup and execution. The setup phase is calling mmap and optionally mlock on the entire region. For most experiments this is trivial, but it highlights the cost of faulting in the entire region. Results are averages across the 20 runs in milliseconds. mmap with mlock(MLOCK_LOCKED) on entire range: Setup avg: 8228.666 Processing avg: 8274.257 mmap with mlock(MLOCK_LOCKED) before each access: Setup avg: 0.113 Processing avg: 90993.552 mmap with PROT_NONE and signal handler and batch size of 1 page: With the default value in max_map_count, this gets ENOMEM as I attempt to change the permissions, after upping the sysctl significantly I get: Setup avg: 0.058 Processing avg: 69488.073 mmap with PROT_NONE and signal handler and batch size of 8 pages: Setup avg: 0.068 Processing avg: 38204.116 mmap with PROT_NONE and signal handler and batch size of 16 pages: Setup avg: 0.044 Processing avg: 29671.180 mmap with mlock(MLOCK_ONFAULT) on entire range: Setup avg: 0.189 Processing avg: 17904.899 The signal handler in the batch cases faulted in memory in two steps to avoid having to know the start and end of the faulting mapping. The first step covers the page that caused the fault as we know that it will be possible to lock. The second step speculatively tries to mlock and mprotect the batch size - 1 pages that follow. There may be a clever way to avoid this without having the program track each mapping to be covered by this handeler in a globally accessible structure, but I could not find it. It should be noted that with a large enough batch size this two step fault handler can still cause the program to crash if it reaches far beyond the end of the mapping. These results show that if the developer knows that a majority of the mapping will be used, it is better to try and fault it in at once, otherwise mlock(MLOCK_ONFAULT) is significantly faster. The performance cost of these patches are minimal on the two benchmarks I have tested (stream and kernbench). The following are the average values across 20 runs of stream and 10 runs of kernbench after a warmup run whose results were discarded. Avg throughput in MB/s from stream using 1000000 element arrays Test 4.2-rc1 4.2-rc1+lock-on-fault Copy: 10,566.5 10,421 Scale: 10,685 10,503.5 Add: 12,044.1 11,814.2 Triad: 12,064.8 11,846.3 Kernbench optimal load 4.2-rc1 4.2-rc1+lock-on-fault Elapsed Time 78.453 78.991 User Time 64.2395 65.2355 System Time 9.7335 9.7085 Context Switches 22211.5 22412.1 Sleeps 14965.3 14956.1 This patch (of 6): Extending the mlock system call is very difficult because it currently does not take a flags argument. A later patch in this set will extend mlock to support a middle ground between pages that are locked and faulted in immediately and unlocked pages. To pave the way for the new system call, the code needs some reorganization so that all the actual entry point handles is checking input and translating to VMA flags. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2015-11-05 19:34:48 -08:00
..
kasan	kasan: always taint kernel on report	2015-11-05 19:34:48 -08:00
backing-dev.c	writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy()	2015-10-21 08:17:29 -06:00
balloon_compaction.c	mm: page migration trylock newpage at same level as oldpage	2015-11-05 19:34:48 -08:00
bootmem.c	bootmem: avoid freeing to bootmem after bootmem is done	2015-09-08 15:35:28 -07:00
cleancache.c	cleancache: remove limit on the number of cleancache enabled filesystems	2015-04-14 16:49:03 -07:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
cma.c	mm/cma.c: suppress warning	2015-11-05 19:34:48 -08:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
compaction.c	mm, compaction: distinguish contended status in tracepoints	2015-11-05 19:34:48 -08:00
debug-pagealloc.c	mm/debug-pagealloc: make debug-pagealloc boottime configurable	2014-12-13 12:42:48 -08:00
debug.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
dmapool.c	dmapool: fix overflow condition in pool_find_page()	2015-10-01 21:42:35 -04:00
early_ioremap.c	mm/early_ioremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
fadvise.c	writeback: implement and use inode_congested()	2015-06-02 08:33:35 -06:00
failslab.c	debugfs: Pass bool pointer to debugfs_create_bool()	2015-10-04 11:36:07 +01:00
filemap.c	mm: rename mem_cgroup_migrate to mem_cgroup_replace_page	2015-11-05 19:34:48 -08:00
frame_vector.c	mm: fix docbook comment for get_vaddr_frames()	2015-11-05 19:34:48 -08:00
frontswap.c	frontswap: allow multiple backends	2015-06-24 17:49:45 -07:00
gup.c	mm: make GUP handle pfn mapping unless FOLL_GET is requested	2015-09-04 16:54:41 -07:00
highmem.c	mm/highmem: make kmap cache coloring aware	2014-08-06 18:01:22 -07:00
huge_memory.c	- Support for new MM features in ARCv2 cores (THP, PAE40)	2015-11-03 13:21:09 -08:00
hugetlb_cgroup.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
hugetlb.c	mm, hugetlbfs: optimize when NUMA=n	2015-11-05 19:34:48 -08:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c
internal.h	mm: page migration fix PageMlocked on migrated pages	2015-11-05 19:34:48 -08:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
Kconfig	media updates for v4.3-rc1	2015-09-11 16:42:39 -07:00
Kconfig.debug	mm/debug_pagealloc: remove obsolete Kconfig options	2015-01-08 15:10:52 -08:00
kmemcheck.c	mm/slab_common: move kmem_cache definition to internal header	2014-10-09 22:25:50 -04:00
kmemleak-test.c	mm/kmemleak-test.c: use pr_fmt for logging	2014-06-06 16:08:18 -07:00
kmemleak.c	mm/kmemleak.c: remove unneeded initialization of object to NULL	2015-11-05 19:34:48 -08:00
ksm.c	ksm: unstable_tree_search_insert error checking cleanup	2015-11-05 19:34:48 -08:00
list_lru.c	memcg: simplify and inline __mem_cgroup_from_kmem	2015-11-05 19:34:48 -08:00
maccess.c	mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe	2015-11-05 19:34:48 -08:00
madvise.c	mm: madvise allow remove operation for hugetlbfs	2015-09-08 15:35:28 -07:00
Makefile	media updates for v4.3-rc1	2015-09-11 16:42:39 -07:00
memblock.c	mm/memblock: make memblock_remove_range() static	2015-11-05 19:34:48 -08:00
memcontrol.c	memcg: fix thresholds for 32b architectures.	2015-11-05 19:34:48 -08:00
memory_hotplug.c	mm/page_alloc: remove unused parameter in init_currently_empty_zone()	2015-11-05 19:34:48 -08:00
memory-failure.c	mm: hwpoison: ratelimit messages from unpoison_memory()	2015-11-05 19:34:48 -08:00
memory.c	mm, dax: fix DAX deadlocks	2015-10-16 11:42:28 -07:00
mempolicy.c	mm: rename alloc_pages_exact_node() to __alloc_pages_node()	2015-09-08 15:35:28 -07:00
mempool.c	mm/mempool: allow NULL `pool' pointer in mempool_destroy()	2015-09-08 15:35:28 -07:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	mm: migrate dirty page without clear_page_dirty_for_io etc	2015-11-05 19:34:48 -08:00
mincore.c	mm/mincore: use offset_in_page macro	2015-11-05 19:34:48 -08:00
mlock.c	mm: mlock: refactor mlock, munlock, and munlockall code	2015-11-05 19:34:48 -08:00
mm_init.c	mm: meminit: remove mminit_verify_page_links	2015-06-30 19:44:56 -07:00
mmap.c	mm/mmap.c: change __install_special_mapping() args order	2015-11-05 19:34:48 -08:00
mmu_context.c
mmu_notifier.c	mmu-notifier: add clear_young callback	2015-09-10 13:29:01 -07:00
mmzone.c	mm: microoptimize zonelist operations	2015-02-11 17:06:02 -08:00
mprotect.c	userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx	2015-09-04 16:54:41 -07:00
mremap.c	mm/mremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	mm: page_alloc: pass PFN to __free_pages_bootmem	2015-06-30 19:44:55 -07:00
nommu.c	mm/nommu.c: drop unlikely inside BUG_ON()	2015-11-05 19:34:48 -08:00
oom_kill.c	mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()	2015-11-05 19:34:48 -08:00
page_alloc.c	memcg: simplify charging kmem pages	2015-11-05 19:34:48 -08:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
page_idle.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
page_io.c	fs: use helper bio_add_page() instead of open coding on bi_io_vec	2015-08-13 12:32:00 -06:00
page_isolation.c	mm, page_isolation: make set/unset_migratetype_isolate() file-local	2015-09-08 15:35:28 -07:00
page_owner.c	mm/page_owner: set correct gfp_mask on page_owner	2015-07-17 16:39:54 -07:00
page-writeback.c	writeback: fix incorrect calculation of available memory for memcg domains	2015-10-12 10:31:13 -06:00
pagewalk.c	mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers	2015-03-25 16:20:30 -07:00
percpu-km.c	percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated	2014-09-02 14:46:05 -04:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	mm/percpu: use offset_in_page macro	2015-11-05 19:34:48 -08:00
pgtable-generic.c	mm,thp: introduce flush_pmd_tlb_range	2015-10-17 17:48:20 +05:30
process_vm_access.c	process_vm_access: switch to {compat_,}import_iovec()	2015-04-11 22:27:12 -04:00
quicklist.c
readahead.c	mm: use only per-device readahead limit	2015-11-05 19:34:48 -08:00
rmap.c	mm: page migration use migration entry for swapcache too	2015-11-05 19:34:48 -08:00
shmem.c	tmpfs: avoid a little creat and stat slowdown	2015-11-05 19:34:48 -08:00
slab_common.c	mm/slab_common.c: initialize kmem_cache pointer to NULL	2015-11-05 19:34:48 -08:00
slab.c	memcg: unify slab and other kmem pages charging	2015-11-05 19:34:48 -08:00
slab.h	memcg: unify slab and other kmem pages charging	2015-11-05 19:34:48 -08:00
slob.c	mm: rename alloc_pages_exact_node() to __alloc_pages_node()	2015-09-08 15:35:28 -07:00
slub.c	mm, slub, kasan: enable user tracking by default with KASAN=y	2015-11-05 19:34:48 -08:00
sparse-vmemmap.c
sparse.c
swap_cgroup.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap_state.c	mm: swap: zswap: maybe_preload & refactoring	2015-09-08 15:35:28 -07:00
swap.c	mm: introduce idle page tracking	2015-09-10 13:29:01 -07:00
swapfile.c	mm: /proc/pid/smaps:: show proportional swap share of the mapping	2015-09-08 15:35:28 -07:00
truncate.c	memcg: add per cgroup dirty page accounting	2015-06-02 08:33:33 -06:00
userfaultfd.c	userfaultfd: avoid mmap_sem read recursion in mcopy_atomic	2015-09-04 16:54:41 -07:00
util.c	mm/util: use offset_in_page macro	2015-11-05 19:34:48 -08:00
vmacache.c	mm/vmacache: inline vmacache_valid_mm()	2015-11-05 19:34:48 -08:00
vmalloc.c	mm/vmalloc: use offset_in_page macro	2015-11-05 19:34:48 -08:00
vmpressure.c	mm/vmpressure.c: fix race in vmpressure_work_fn()	2014-12-02 17:32:07 -08:00
vmscan.c	mm/vmscan.c: fix types of some locals	2015-11-05 19:34:48 -08:00
vmstat.c	mm/vmstat.c: uninline node_page_state()	2015-11-05 19:34:48 -08:00
workingset.c	list_lru: add helpers to isolate items	2015-02-12 18:54:10 -08:00
zbud.c	mm: zbud: constify the zbud_ops	2015-09-08 15:35:28 -07:00
zpool.c	zpool: add zpool_has_pool()	2015-09-10 13:29:01 -07:00
zsmalloc.c	mm: zpool: constify the zpool_ops	2015-09-08 15:35:28 -07:00
zswap.c	zswap: change zpool/compressor at runtime	2015-09-10 13:29:01 -07:00