Many migrate_page's caller check return value instead of list_empy by
cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting"). This patch
makes compaction's migrate_pages consistent with others. This patch
should not change old behavior.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
order > 0") due to reports stating that kswapd CPU usage was higher and
IRQs were being disabled more frequently. This was reported at
http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.
Without this patch applied, CPU usage by kswapd hovers around the 20% mark
according to the tester (Arthur Marsh:
http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
patch applied, it's around 2%.
The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
triggered by high-order allocations hitting the low watermark for their
order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
common trigger for this is network cards configured for jumbo frames but
it's also possible it'll be triggered by fork-heavy workloads (order-1)
and some wireless cards which depend on order-1 allocations.
The symptoms for the user will be high CPU usage by kswapd in low-memory
situations which could be confused with another writeback problem. While
a patch like 5a03b051 may be reintroduced in the future, this patch plays
it safe for now and reverts it.
[mel@csn.ul.ie: Beefed up the changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: <stable@kernel.org> [2.6.38.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.
This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.
The free area cache is reset if areas are freed behind it, or if we are
searching for a smaller area or alignment than last time. So allocation
patterns are not changed (verified by corner-case and random test cases in
userspace testing).
This solves a regression caused by lazy vunmap TLB purging introduced in
db64fe02 (mm: rewrite vmap layer). That patch will leave extents in the
vmap allocator after they are vunmapped, and until a significant number
accumulate that can be flushed in a single batch. So in a workload that
vmalloc/vfree frequently, a chain of extents will build up from
VMALLOC_START address, which have to be iterated over each time (giving an
O(n) type of behaviour).
After this patch, the search will start from where it left off, giving
closer to an amortized O(1).
This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.
Hugh's update:
: I tried out the recent mmotm, and on one machine was fortunate to hit
: the BUG_ON(first->va_start < addr) which seems to have been stalling
: your vmap area cache patch ever since May.
: I can get you addresses etc, I did dump a few out; but once I stared
: at them, it was easier just to look at the code: and I cannot see how
: you would be so sure that first->va_start < addr, once you've done
: that addr = ALIGN(max(...), align) above, if align is over 0x1000
: (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).
: I originally got around it by just changing the
: if (first->va_start < addr) {
: to
: while (first->va_start < addr) {
: without thinking about it any further; but that seemed unsatisfactory,
: why would we want to loop here when we've got another very similar
: loop just below it?
: I am never going to admit how long I've spent trying to grasp your
: "while (n)" rbtree loop just above this, the one with the peculiar
: if (!first && tmp->va_start < addr + size)
: in. That's unfamiliar to me, I'm guessing it's designed to save a
: subsequent rb_next() in a few circumstances (at risk of then setting
: a wrong cached_hole_size?); but they did appear few to me, and I didn't
: feel I could sign off something with that in when I don't grasp it,
: and it seems responsible for extra code and mistaken BUG_ON below it.
: I've reverted to the familiar rbtree loop that find_vma() does (but
: with va_end >= addr as you had, to respect the additional guard page):
: and then (given that cached_hole_size starts out 0) I don't see the
: need for any complications below it. If you do want to keep that loop
: as you had it, please add a comment to explain what it's trying to do,
: and where addr is relative to first when you emerge from it.
: Aren't your tests "size <= cached_hole_size" and
: "addr + size > first->va_start" forgetting the guard page we want
: before the next area? I've changed those.
: I have not changed your many "addr + size - 1 < addr" overflow tests,
: but have since come to wonder, shouldn't they be "addr + size < addr"
: tests - won't the vend checks go wrong if addr + size is 0?
: I have added a few comments - Wolfgang Wander's 2.6.13 description of
: 1363c3cd86 Avoiding mmap fragmentation
: helped me a lot, perhaps a pointer to that would be good too. And I found
: it easier to understand when I renamed cached_start slightly and moved the
: overflow label down.
: This patch would go after your mm-vmap-area-cache.patch in mmotm.
: Trivially, nobody is going to get that BUG_ON with this patch, and it
: appears to work fine on my machines; but I have not given it anything like
: the testing you did on your original, and may have broken all the
: performance you were aiming for. Please take a look and test it out
: integrate with yours if you're satisfied - thanks.
[akpm@linux-foundation.org: add locking comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reported-and-tested-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-and-tested-by: Avi Kivity <avi@redhat.com>
Tested-by: "Barry J. Marson" <bmarson@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm. This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed. This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.
The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.
This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path. It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org> [2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set. Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:
495789a5 (oom: make oom_score to per-process value)
cef1d352 (oom: multi threaded process coredump don't make deadlock)
First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.
It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks. If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes. Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.
By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.
Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting. We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom. If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.
Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org> [2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an administrator tries to swapon a file backed by NFS, the inode mutex is
taken (as it is for any swapfile) but later identified to be a bad swapfile
due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
made to close the file but with inode->i_mutex still held. Closing an NFS
file syncs it which tries to acquire the inode mutex leading to deadlock. If
lockdep is enabled the following appears on the console;
=============================================
[ INFO: possible recursive locking detected ]
2.6.38-rc8-autobuild #1
---------------------------------------------
swapon/2192 is trying to acquire lock:
(&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c
but task is already holding lock:
(&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
other info that might help us debug this:
1 lock held by swapon/2192:
#0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
stack backtrace:
Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
Call Trace:
__lock_acquire+0x2eb/0x1623
find_get_pages_tag+0x14a/0x174
pagevec_lookup_tag+0x25/0x2e
vfs_fsync_range+0x47/0x7c
lock_acquire+0xd3/0x100
vfs_fsync_range+0x47/0x7c
nfs_flush_one+0x0/0xdf [nfs]
mutex_lock_nested+0x40/0x2b1
vfs_fsync_range+0x47/0x7c
vfs_fsync_range+0x47/0x7c
vfs_fsync+0x1c/0x1e
nfs_file_flush+0x64/0x69 [nfs]
filp_close+0x43/0x72
sys_swapon+0xa39/0xae7
sysret_check+0x2e/0x69
system_call_fastpath+0x16/0x1b
This patch releases the mutex if its held before calling filep_close()
so swapon fails as expected without deadlock when the swapfile is backed
by NFS. If accepted for 2.6.39, it should also be considered a -stable
candidate for 2.6.38 and 2.6.37.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@kernel.org> [2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
slab,rcu: don't assume the size of struct rcu_head
slub,rcu: don't assume the size of struct rcu_head
slub: automatically reserve bytes at the end of slab
Lockless (and preemptless) fastpaths for slub
slub: Get rid of slab_free_hook_irq()
slub: min_partial needs to be in first cacheline
slub: fix ksize() build error
slub: fix kmemcheck calls to match ksize() hints
Revert "slab: Fix missing DEBUG_SLAB last user"
mm: Remove support for kmem_cache_name()
The redo label needs #ifdeffery. Fixes the following problem introduced by
commit 8a5ec0ba42 ("Lockless (and preemptless) fastpaths for slub"):
mm/slub.c: In function 'slab_free':
mm/slub.c:2124: warning: label 'redo' defined but not used
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
* 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits)
KVM: unbreak userspace that does not sets tss address
KVM: MMU: cleanup pte write path
KVM: MMU: introduce a common function to get no-dirty-logged slot
KVM: fix rcu usage in init_rmode_* functions
KVM: fix kvmclock regression due to missing clock update
KVM: emulator: Fix permission checking in io permission bitmap
KVM: emulator: Fix io permission checking for 64bit guest
KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n
KVM: x86: Remove useless regs_page pointer from kvm_lapic
KVM: improve comment on rcu use in irqfd_deassign
KVM: MMU: remove unused macros
KVM: MMU: cleanup page alloc and free
KVM: MMU: do not record gfn in kvm_mmu_pte_write
KVM: MMU: move mmu pages calculated out of mmu lock
KVM: MMU: set spte accessed bit properly
KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits
KVM: Start lock documentation
KVM: better readability of efer_reserved_bits
KVM: Clear async page fault hash after switching to real mode
KVM: VMX: Initialize vm86 TSS only once.
...
Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.
Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified. With this patch, the interested callers
can distinguish HWPOISON pages from general FAULT pages, while other
callers will still get -EFAULT for all these pages, so the user space
interface need not to be changed.
This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.
The idea comes from Andrew Morton.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In most cases, get_user_pages and get_user_pages_fast should be used
to pin user pages in memory. But sometimes, some special flags except
FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
following patch, KVM needs FOLL_HWPOISON. To support these users,
__get_user_pages is exported directly.
There are some symbol name conflicts in infiniband driver, fixed them too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Michel Lespinasse <walken@google.com>
CC: Roland Dreier <roland@kernel.org>
CC: Ralph Campbell <infinipath@qlogic.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
AppArmor: kill unused macros in lsm.c
AppArmor: cleanup generated files correctly
KEYS: Add an iovec version of KEYCTL_INSTANTIATE
KEYS: Add a new keyctl op to reject a key with a specified error code
KEYS: Add a key type op to permit the key description to be vetted
KEYS: Add an RCU payload dereference macro
AppArmor: Cleanup make file to remove cruft and make it easier to read
SELinux: implement the new sb_remount LSM hook
LSM: Pass -o remount options to the LSM
SELinux: Compute SID for the newly created socket
SELinux: Socket retains creator role and MLS attribute
SELinux: Auto-generate security_is_socket_class
TOMOYO: Fix memory leak upon file open.
Revert "selinux: simplify ioctl checking"
selinux: drop unused packet flow permissions
selinux: Fix packet forwarding checks on postrouting
selinux: Fix wrong checks for selinux_policycap_netpeer
selinux: Fix check for xfrm selinux context algorithm
ima: remove unnecessary call to ima_must_measure
IMA: remove IMA imbalance checking
...
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (93 commits)
x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()
x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()
x86-64, NUMA: Clean up initmem_init()
x86-64, NUMA: Fix numa_emulation code with node0 without RAM
x86-64, NUMA: Revert NUMA affine page table allocation
x86: Work around old gas bug
x86-64, NUMA: Better explain numa_distance handling
x86-64, NUMA: Fix distance table handling
mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK
x86-64, NUMA: Fix size of numa_distance array
x86: Rename e820_table_* to pgt_buf_*
bootmem: Move __alloc_memory_core_early() to nobootmem.c
bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance()
x86-64, NUMA: Add proper function comments to global functions
x86-64, NUMA: Move NUMA emulation into numa_emulation.c
x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
x86-64, NUMA: Do not scan two times for setup_node_bootmem()
...
Fix up conflicts in arch/x86/kernel/smpboot.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
tidy the trailing symlinks traversal up
Turn resolution of trailing symlinks iterative everywhere
simplify link_path_walk() tail
Make trailing symlink resolution in path_lookupat() iterative
update nd->inode in __do_follow_link() instead of after do_follow_link()
pull handling of one pathname component into a helper
fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
readlinkat(), fchownat() and fstatat() with empty relative pathnames
Allow O_PATH for symlinks
New kind of open files - "location only".
ext4: Copy fs UUID to superblock
ext3: Copy fs UUID to superblock.
vfs: Export file system uuid via /proc/<pid>/mountinfo
unistd.h: Add new syscalls numbers to asm-generic
x86: Add new syscalls for x86_64
x86: Add new syscalls for x86_32
fs: Remove i_nlink check from file system link callback
fs: Don't allow to create hardlink for deleted file
vfs: Add open by file handle support
...
This reverts the parent commit. I hate doing that, but it's generating
some discussion ("half of it is right"), and since I am planning on
doing the 2.6.38 release later today we can punt it to stable if
required. Let's not rock the boat right now.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.
Also, "children has a different mm" doesn't match the reality, we should
check child->mm != t->mm. This check is not exactly correct if t->mm ==
NULL but this doesn't really matter, oom_kill_task() will kill them
anyway.
Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When vmscan.c calls page_referenced(), if an anon page was created
before a process forked, rmap will search for it in both of the
processes, even though one of them might have since broken COW.
If the child process mlocks the vma where the COWed page belongs to,
page_referenced() running on the page mapped by the parent would lead to
*vm_flags getting VM_LOCKED set erroneously (leading to the references
on the parent page being ignored and evicting the parent page too
early).
*mapcount would also be decremented by page_referenced_one even if the
page wasn't found by page_check_address.
This also lets pmdp_clear_flush_young_notify() go ahead on a
pmd_trans_splitting() pmd.
We hold the page_table_lock so __split_huge_page_map() must wait the
pmdp_clear_flush_young_notify() to complete before it can modify the
pmd. The pmd is also still mapped in userland so the young bit may
materialize through a tlb miss before split_huge_page_map runs.
This will provide a more accurate page_referenced() behavior during
split_huge_page().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel<riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of struct rcu_head may be changed. When it becomes larger,
it may pollute the data after struct slab.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The size of struct rcu_head may be changed. When it becomes larger,
it will pollute the page array.
We reserve some some bytes for struct rcu_head when a slab
is allocated in this situation.
Changed from V1:
use VM_BUG_ON instead BUG_ON
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
There is no "struct" for slub's slab, it shares with struct page.
But struct page is very small, it is insufficient when we need
to add some metadata for slab.
So we add a field "reserved" to struct kmem_cache, when a slab
is allocated, kmem_cache->reserved bytes are automatically reserved
at the end of the slab for slab's metadata.
Changed from v1:
Export the reserved field via sysfs
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Use the this_cpu_cmpxchg_double functionality to implement a lockless
allocation algorithm on arches that support fast this_cpu_ops.
Each of the per cpu pointers is paired with a transaction id that ensures
that updates of the per cpu information can only occur in sequence on
a certain cpu.
A transaction id is a "long" integer that is comprised of an event number
and the cpu number. The event number is incremented for every change to the
per cpu state. This means that the cmpxchg instruction can verify for an
update that nothing interfered and that we are updating the percpu structure
for the processor where we picked up the information and that we are also
currently on that processor when we update the information.
This results in a significant decrease of the overhead in the fastpaths. It
also makes it easy to adopt the fast path for realtime kernels since this
is lockless and does not require the use of the current per cpu area
over the critical section. It is only important that the per cpu area is
current at the beginning of the critical section and at the end.
So there is no need even to disable preemption.
Test results show that the fastpath cycle count is reduced by up to ~ 40%
(alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
adds a few cycles.
Sadly this does nothing for the slowpath which is where the main issues with
performance in slub are but the best case performance rises significantly.
(For that see the more complex slub patches that require cmpxchg_double)
Kmalloc: alloc/free test
Before:
10000 times kmalloc(8)/kfree -> 134 cycles
10000 times kmalloc(16)/kfree -> 152 cycles
10000 times kmalloc(32)/kfree -> 144 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 142 cycles
10000 times kmalloc(256)/kfree -> 132 cycles
10000 times kmalloc(512)/kfree -> 132 cycles
10000 times kmalloc(1024)/kfree -> 135 cycles
10000 times kmalloc(2048)/kfree -> 135 cycles
10000 times kmalloc(4096)/kfree -> 135 cycles
10000 times kmalloc(8192)/kfree -> 144 cycles
10000 times kmalloc(16384)/kfree -> 754 cycles
After:
10000 times kmalloc(8)/kfree -> 78 cycles
10000 times kmalloc(16)/kfree -> 78 cycles
10000 times kmalloc(32)/kfree -> 82 cycles
10000 times kmalloc(64)/kfree -> 88 cycles
10000 times kmalloc(128)/kfree -> 79 cycles
10000 times kmalloc(256)/kfree -> 79 cycles
10000 times kmalloc(512)/kfree -> 85 cycles
10000 times kmalloc(1024)/kfree -> 82 cycles
10000 times kmalloc(2048)/kfree -> 82 cycles
10000 times kmalloc(4096)/kfree -> 85 cycles
10000 times kmalloc(8192)/kfree -> 82 cycles
10000 times kmalloc(16384)/kfree -> 706 cycles
Kmalloc: Repeatedly allocate then free test
Before:
10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
After:
10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The following patch will make the fastpaths lockless and will no longer
require interrupts to be disabled. Calling the free hook with irq disabled
will no longer be possible.
Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
interrupts if the features are selected that require callbacks with
interrupts off and reenable after calls have been made.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Pass down the correct node for a transparent hugepage allocation. Most
callers continue to use the current node, however the hugepaged daemon
now uses the previous node of the first to be collapsed page instead.
This ensures that khugepaged does not mess up local memory for an
existing process which uses local policy.
The choice of node is somewhat primitive currently: it just uses the
node of the first page in the pmd range. An alternative would be to
look at multiple pages and use the most popular node. I used the
simplest variant for now which should work well enough for the case of
all pages being on the same node.
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes a difference for LOCAL policy, where the node cannot be
determined from the policy itself, but has to be gotten from the original
page.
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently alloc_pages_vma() always uses the local node as policy node for
the LOCAL policy. Pass this node down as an argument instead.
No behaviour change from this patch, but will be needed for followons.
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slub.c: In function 'ksize':
mm/slub.c:2728: error: implicit declaration of function 'slab_ksize'
slab_ksize() needs to go out of CONFIG_SLUB_DEBUG section.
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Heiko found recent memblock change triggers these warnings on s390:
mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used
mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used
Need to move those two function under HAVE_MEMBLOCK with its only
user, find_memory_core_early().
-tj: Minor updates to description.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
It seems odd that truncate_inode_pages_range(), called not only when
truncating but also when evicting inodes, has mem_cgroup_uncharge_start
and _end() batching in its second loop to clear up a few leftovers, but
not in its first loop that does almost all the work: add them there too.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
should_continue_reclaim() for reclaim/compaction allows scanning to
continue even if pages are not being reclaimed until the full list is
scanned. In terms of allocation success, this makes sense but potentially
it introduces unwanted latency for high-order allocations such as
transparent hugepages and network jumbo frames that would prefer to fail
the allocation attempt and fallback to order-0 pages. Worse, there is a
potential that the full LRU scan will clear all the young bits, distort
page aging information and potentially push pages into swap that would
have otherwise remained resident.
This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
LRU list may still be scanned.
Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined:
if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;
A tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and
three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
applied.
STREAM Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 10298 10229
1 :: Min 0.4560 0.4640
1 :: Mean 1.0589 1.0183
1 :: Max 14.5990 11.7510
1 :: Stddev 0.5208 0.4719
2 :: Count 2 1
2 :: Min 1.8610 3.7240
2 :: Mean 3.4325 3.7240
2 :: Max 5.0040 3.7240
2 :: Stddev 1.5715 0.0000
9 :: Count 111696 111694
9 :: Min 0.5230 0.4110
9 :: Mean 10.5831 10.5718
9 :: Max 38.4480 43.2900
9 :: Stddev 1.1147 1.1325
Mean time for order-1 allocations is reduced. order-2 looks increased but
with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.
Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.
SysBench Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 15745 15677
1 :: Min 0.4250 0.4550
1 :: Mean 1.1023 1.0810
1 :: Max 14.4590 10.8220
1 :: Stddev 0.5117 0.5100
2 :: Count 1 1
2 :: Min 3.0040 2.1530
2 :: Mean 3.0040 2.1530
2 :: Max 3.0040 2.1530
2 :: Stddev 0.0000 0.0000
9 :: Count 2017 1931
9 :: Min 0.4980 0.7480
9 :: Mean 10.4717 10.3840
9 :: Max 24.9460 26.2500
9 :: Stddev 1.1726 1.1966
Again, mean time for order-1 allocations is reduced while order-2
allocations are too few to draw conclusions from. The mean time for THP
allocations is also slightly reduced albeit the reductions are within
varianes.
Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.
Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count 1376 1790
1 :: Min 0.4940 0.5010
1 :: Mean 1.0289 0.9732
1 :: Max 6.2670 4.2540
1 :: Stddev 0.4142 0.2785
2 :: Count 1 -
2 :: Min 1.9060 -
2 :: Mean 1.9060 -
2 :: Max 1.9060 -
2 :: Stddev 0.0000 -
9 :: Count 11266 11257
9 :: Min 0.4990 0.4940
9 :: Mean 27250.4669 24256.1919
9 :: Max 11439211.0000 6008885.0000
9 :: Stddev 226427.4624 186298.1430
This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap
quite heavily and is intended to exercise THP a bit.
Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to
swap. As before, maximum allocation times are significantly reduced.
Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.
The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments
due to reclaim/compaction breaking early. However, workloads care more
about the latency of lower-order allocations than THP so it's an
acceptable trade-off.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Grab a reference to bdev before calling blkdev_get(), which expects
the refcount to be already incremented and either returns success or
decrements the refcount and returns an error.
The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that bootmem.c and nobootmem.c are separate, there's no reason to
define __alloc_memory_core_early(), which is used only by nobootmem,
inside #ifdef in page_alloc.c. Move it to nobootmem.c and make it
static.
This patch doesn't introduce any behavior change.
-tj: Updated commit description.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>