Commit Graph

10214 Commits

Author SHA1 Message Date
Linus Torvalds
c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Linus Torvalds
7f427d3a60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro.

This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.

The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.

The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).

A more detailed explanation of the process from Al Viro:
 "The xattr work starts with some acl fixes, then switches ->getxattr to
  passing inode and dentry separately.  This is the point where the
  things start to get tricky - that got merged into the very beginning
  of the -rc3-based #work.lookups, to allow untangling the
  security_d_instantiate() mess.  The xattr work itself proceeds to
  switch a lot of filesystems to generic_...xattr(); no complications
  there.

  After that initial xattr work, the series then does the following:

   - untangle security_d_instantiate()

   - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
     that thing; one such place (in overlayfs) actually yields a trivial
     conflict with overlayfs fixes later in the cycle - overlayfs ended
     up switching to a variant of lookup_one_len_unlocked() sans the
     permission checks.  I would've dropped that commit (it gets
     overridden on merge from #ovl-fixes in #for-next; proper resolution
     is to use the variant in mainline fs/overlayfs/super.c), but I
     didn't want to rebase the damn thing - it was fairly late in the
     cycle...

   - some filesystems had managed to depend on lookup/lookup exclusion
     for *fs-internal* data structures in a way that would break if we
     relaxed the VFS exclusion.  Fixing hadn't been hard, fortunately.

   - core of that series - parallel lookup machinery, replacing
     ->i_mutex with rwsem, making lookup_slow() take it only shared.  At
     that point lookups happen in parallel; lookups on the same name
     wait for the in-progress one to be done with that dentry.

     Surprisingly little code, at that - almost all of it is in
     fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
     making it use the new primitive and actually switching to locking
     shared.

   - parallel readdir stuff - first of all, we provide the exclusion on
     per-struct file basis, same as we do for read() vs lseek() for
     regular files.  That takes care of most of the needed exclusion in
     readdir/readdir; however, these guys are trickier than lookups, so
     I went for switching them one-by-one.  To do that, a new method
     '->iterate_shared()' is added and filesystems are switched to it
     as they are either confirmed to be OK with shared lock on directory
     or fixed to be OK with that.  I hope to kill the original method
     come next cycle (almost all in-tree filesystems are switched
     already), but it's still not quite finished.

   - several filesystems get switched to parallel readdir.  The
     interesting part here is dealing with dcache preseeding by readdir;
     that needs minor adjustment to be safe with directory locked only
     shared.

     Most of the filesystems doing that got switched to in those
     commits.  Important exception: NFS.  Turns out that NFS folks, with
     their, er, insistence on VFS getting the fuck out of the way of the
     Smart Filesystem Code That Knows How And What To Lock(tm) have
     grown the locking of their own.  They had their own homegrown
     rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
     is the reader there).  Of course, with VFS getting the fuck out of
     the way, as requested, the actual smarts of the smart filesystem
     code etc. had become exposed...

   - do_last/lookup_open/atomic_open cleanups.  As the result, open()
     without O_CREAT locks the directory only shared.  Including the
     ->atomic_open() case.  Backmerge from #for-linus in the middle of
     that - atomic_open() fix got brought in.

   - then comes NFS switch to saner (VFS-based ;-) locking, killing the
     homegrown "lookup and readdir are writers" kinda-sorta rwsem.  All
     exclusion for sillyunlink/lookup is done by the parallel lookups
     mechanism.  Exclusion between sillyunlink and rmdir is a real rwsem
     now - rmdir being the writer.

     Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
     now.

   - the rest of the series consists of switching a lot of filesystems
     to parallel readdir; in a lot of cases ->llseek() gets simplified
     as well.  One backmerge in there (again, #for-linus - rockridge
     fix)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
  ext4: switch to ->iterate_shared()
  hfs: switch to ->iterate_shared()
  hfsplus: switch to ->iterate_shared()
  hostfs: switch to ->iterate_shared()
  hpfs: switch to ->iterate_shared()
  hpfs: handle allocation failures in hpfs_add_pos()
  gfs2: switch to ->iterate_shared()
  f2fs: switch to ->iterate_shared()
  afs: switch to ->iterate_shared()
  befs: switch to ->iterate_shared()
  befs: constify stuff a bit
  isofs: switch to ->iterate_shared()
  get_acorn_filename(): deobfuscate a bit
  btrfs: switch to ->iterate_shared()
  logfs: no need to lock directory in lseek
  switch ecryptfs to ->iterate_shared
  9p: switch to ->iterate_shared()
  fat: switch to ->iterate_shared()
  romfs, squashfs: switch to ->iterate_shared()
  more trivial ->iterate_shared conversions
  ...
2016-05-17 11:01:31 -07:00
Al Viro
0e0162bb8c Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
2016-05-17 02:17:59 -04:00
Linus Torvalds
825a3b2605 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - massive CPU hotplug rework (Thomas Gleixner)

 - improve migration fairness (Peter Zijlstra)

 - CPU load calculation updates/cleanups (Yuyang Du)

 - cpufreq updates (Steve Muckle)

 - nohz optimizations (Frederic Weisbecker)

 - switch_mm() micro-optimization on x86 (Andy Lutomirski)

 - ... lots of other enhancements, fixes and cleanups.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
  ARM: Hide finish_arch_post_lock_switch() from modules
  sched/core: Provide a tsk_nr_cpus_allowed() helper
  sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
  sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
  sched/fair: Correct unit of load_above_capacity
  sched/fair: Clean up scale confusion
  sched/nohz: Fix affine unpinned timers mess
  sched/fair: Fix fairness issue on migration
  sched/core: Kill sched_class::task_waking to clean up the migration logic
  sched/fair: Prepare to fix fairness problems on migration
  sched/fair: Move record_wakee()
  sched/core: Fix comment typo in wake_q_add()
  sched/core: Remove unused variable
  sched: Make hrtick_notifier an explicit call
  sched/fair: Make ilb_notifier an explicit call
  sched/hotplug: Make activate() the last hotplug step
  sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()
  sched/migration: Move CPU_ONLINE into scheduler state
  sched/migration: Move calc_load_migrate() into CPU_DYING
  sched/migration: Move prepare transition to SCHED_STARTING state
  ...
2016-05-16 14:47:16 -07:00
Andrea Arcangeli
6d0a07edd1 mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.

Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.

Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.

Mike Marciniszyn said:

: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data.  The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous.   The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!

Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org>	[4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Zhou Chengming
7496fea9a6 ksm: fix conflict between mmput and scan_get_next_rmap_item
A concurrency issue about KSM in the function scan_get_next_rmap_item.

task A (ksmd):				|task B (the mm's task):
					|
mm = slot->mm;				|
down_read(&mm->mmap_sem);		|
					|
...					|
					|
spin_lock(&ksm_mmlist_lock);		|
					|
ksm_scan.mm_slot go to the next slot;	|
					|
spin_unlock(&ksm_mmlist_lock);		|
					|mmput() ->
					|	ksm_exit():
					|
					|spin_lock(&ksm_mmlist_lock);
					|if (mm_slot && ksm_scan.mm_slot != mm_slot) {
					|	if (!mm_slot->rmap_list) {
					|		easy_to_free = 1;
					|		...
					|
					|if (easy_to_free) {
					|	mmdrop(mm);
					|	...
					|
					|So this mm_struct may be freed in the mmput().
					|
up_read(&mm->mmap_sem);			|

As we can see above, the ksmd thread may access a mm_struct that already
been freed to the kmem_cache.  Suppose a fork will get this mm_struct from
the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
cause mmap_sem.count to become -1.

As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
the same SMP race condition, so fix it too.  My prev fix in function
scan_get_next_rmap_item will introduce a different SMP race condition, so
just invert the up_read/spin_unlock order as Andrea Arcangeli said.

Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Ingo Molnar
eb60b3e5e8 Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:18:13 +02:00
Sergey Senozhatsky
44f43e99fe zsmalloc: fix zs_can_compact() integer overflow
zs_can_compact() has two race conditions in its core calculation:

unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
				zs_stat_get(class, OBJ_USED);

1) classes are not locked, so the numbers of allocated and used
   objects can change by the concurrent ops happening on other CPUs
2) shrinker invokes it from preemptible context

Depending on the circumstances, thus, OBJ_ALLOCATED can become
less than OBJ_USED, which can result in either very high or
negative `total_scan' value calculated later in do_shrink_slab().

do_shrink_slab() has some logic to prevent those cases:

 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64
 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62

However, due to the way `total_scan' is calculated, not every
shrinker->count_objects() overflow can be spotted and handled.
To demonstrate the latter, I added some debugging code to do_shrink_slab()
(x86_64) and the results were:

 vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
 vmscan: but total_scan > 0: 92679974445502
 vmscan: resulting total_scan: 92679974445502
[..]
 vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
 vmscan: but total_scan > 0: 22634041808232578
 vmscan: resulting total_scan: 22634041808232578

Even though shrinker->count_objects() has returned an overflowed value,
the resulting `total_scan' is positive, and, what is more worrisome, it
is insanely huge. This value is getting used later on in
shrinker->scan_objects() loop:

        while (total_scan >= batch_size ||
               total_scan >= freeable) {
                unsigned long ret;
                unsigned long nr_to_scan = min(batch_size, total_scan);

                shrinkctl->nr_to_scan = nr_to_scan;
                ret = shrinker->scan_objects(shrinker, shrinkctl);
                if (ret == SHRINK_STOP)
                        break;
                freed += ret;

                count_vm_events(SLABS_SCANNED, nr_to_scan);
                total_scan -= nr_to_scan;

                cond_resched();
        }

`total_scan >= batch_size' is true for a very-very long time and
'total_scan >= freeable' is also true for quite some time, because
`freeable < 0' and `total_scan' is large enough, for example,
22634041808232578. The only break condition, in the given scheme of
things, is shrinker->scan_objects() == SHRINK_STOP test, which is a
bit too weak to rely on, especially in heavy zsmalloc-usage scenarios.

To fix the issue, take a pool stat snapshot and use it instead of
racy zs_stat_get() calls.

Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>        [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09 17:40:59 -07:00
Linus Torvalds
0783783104 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull writeback fix from Jens Axboe:
 "Just a single fix for domain aware writeback, fixing a regression that
  can cause balance_dirty_pages() to keep looping while not getting any
  work done"

* 'for-linus' of git://git.kernel.dk/linux-block:
  writeback: Fix performance regression in wb_over_bg_thresh()
2016-05-06 13:08:35 -07:00
Vlastimil Babka
172400c69c mm: fix kcompactd hang during memory offlining
Assume memory47 is the last online block left in node1.  This will hang:

  # echo offline > /sys/devices/system/node/node1/memory47/state

After a couple of minutes, the following pops up in dmesg:

  INFO: task bash:957 blocked for more than 120 seconds.
         Not tainted 4.6.0-rc6+ #6
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  bash            D ffff8800b7adbaf8     0   957    951 0x00000000
  Call Trace:
    schedule+0x35/0x80
    schedule_timeout+0x1ac/0x270
    wait_for_completion+0xe1/0x120
    kthread_stop+0x4f/0x110
    kcompactd_stop+0x26/0x40
    __offline_pages.constprop.28+0x7e6/0x840
    offline_pages+0x11/0x20
    memory_block_action+0x73/0x1d0
    memory_subsys_offline+0x47/0x60
    device_offline+0x86/0xb0
    store_mem_state+0xda/0xf0
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x37/0x40
    kernfs_fop_write+0x11d/0x170
    __vfs_write+0x37/0x120
    vfs_write+0xa9/0x1a0
    SyS_write+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa4

kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
actually exit.  Check kthread_should_stop() to break out of the wait.

Fixes: 698b1b306 ("mm, compaction: introduce kcompactd").
Reported-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Dan Streetman
32a4e16903 mm/zswap: provide unique zpool name
Instead of using "zswap" as the name for all zpools created, add an
atomic counter and use "zswap%x" with the counter number for each zpool
created, to provide a unique name for each new zpool.

As zsmalloc, one of the zpool implementations, requires/expects a unique
name for each pool created, zswap should provide a unique name.  The
zsmalloc pool creation does not fail if a new pool with a conflicting
name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
zsmalloc pool creation fails with -ENOMEM.  Then zswap will be unable to
change its compressor parameter if its zpool is zsmalloc; it also will
be unable to change its zpool parameter back to zsmalloc, if it has any
existing old zpool using zsmalloc with page(s) in it.  Attempts to
change the parameters will result in failure to create the zpool.  This
changes zswap to provide a unique name for each zpool creation.

Fixes: f1c54846ee ("zswap: dynamic pool creation")
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Hugh Dickins
14af4a5e9b mm, cma: prevent nr_isolated_* counters from going negative
/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay.  The bug in compaction
was due to a recent mmotm patch, but much older instance of the bug was
also noticed in isolate_migratepages_range() which is used for CMA and
gigantic hugepage allocations.

The bug is caused by putback_movable_pages() in an error path
decrementing the isolated counters without them being previously
incremented by acct_isolated().  Fix isolate_migratepages_range() by
removing the error-path putback, thus reaching acct_isolated() with
migratepages still isolated, and leaving putback to caller like most
other places do.

Fixes: edc2ca6124 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
[vbabka@suse.cz: expanded the changelog]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Jason Baron
bc22af74f2 mm: update min_free_kbytes from khugepaged after core initialization
Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.

Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.

The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored.  I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().

This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering.  On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.

Fixes: 79553da293 ("thp: cleanup khugepaged startup")
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Hugh Dickins
684283988f huge pagecache: mmap_sem is unlocked when truncation splits pmd
zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
be invalid with huge pagecache, in whatever way it is implemented:
truncation of a hugely-mapped file to an unhugely-aligned size would
easily hit it.

(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)

We could complicate the test, to check i_mmap_rwsem also when there's a
vm_file; but my inclination was to make zap_pmd_range() more readable by
simply deleting this check.  A search has shown no report of the issue
in the years since commit e0897d75f0 ("mm, thp: print useful
information when mmap_sem is unlocked in zap_pmd_range") expanded it
from VM_BUG_ON() - though I cannot point to what commit I would say then
fixed the issue.

But there are a couple of other patches now floating around, neither yet
in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(), as
Matthew Wilcox has done; but subject to a vma_is_anonymous() check, as
Kirill Shutemov has done.  And let's get this in, without waiting for
any particular huge pagecache implementation to reach the tree.

Matthew said "We can reproduce this BUG() in the current Linus tree with
DAX PMDs".

Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Yang Shi
145bdaa150 mm: thp: correct split_huge_pages file permission
split_huge_pages doesn't support get method at all, so the read
permission sounds confusing, change the permission to write only.

And, add "\n" to the output of set method to make it more readable.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Howard Cochran
74d3694433 writeback: Fix performance regression in wb_over_bg_thresh()
Commit 947e9762a8 ("writeback: update wb_over_bg_thresh() to use
wb_domain aware operations") unintentionally changed this function's
meaning from "are there more dirty pages than the background writeback
threshold" to "are there more dirty pages than the writeback threshold".
The background writeback threshold is typically half of the writeback
threshold, so this had the effect of raising the number of dirty pages
required to cause a writeback worker to perform background writeout.

This can cause a very severe performance regression when a BDI uses
BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
can now disagree on whether writeback should be initiated.

For example, in a system having 1GB of RAM, a single spinning disk, and a
"pass-through" FUSE filesystem mounted over the disk, application code
mmapped a 128MB file on the disk and was randomly dirtying pages in that
mapping.

Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
dirtying processes when we have 151 dirty pages and wakes up a background
writeback worker. But the worker tests the wrong threshold (200 instead of
100), so it does not initiate writeback and just returns.

Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
worker who will do nothing. It remains stuck in this state until the few
dirty pages that we have finally expire and we write them back for that
reason. Then the whole process repeats, resulting in near-zero throughput
through the FUSE BDI.

The fix is to call the parameterized variant of wb_calc_thresh, so that the
worker will do writeback if the bg_thresh is exceeded which was the
behavior before the referenced commit.

Fixes: 947e9762a8 ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
Signed-off-by: Howard Cochran <hcochran@kernelspring.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Tested-by Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-05 15:44:55 -06:00
Al Viro
84e710da2a parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
	* we verify that there's no hashed match
	* existing in-lookup match gets hashed by another process
	* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
	* read the counter
	* do dcache lookup
	* if no matches found, check for in-lookup matches
	* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:26 -04:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
1af5bb491f filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c64fb5c744 filemap: remove pos variables in generic_file_read_iter
Just use ki_pos directly to make everyones life easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Konstantin Khlebnikov
c2e7e00b71 mm/memory-failure: fix race with compound page split/merge
get_hwpoison_page() must recheck relation between head and tail pages.

n-horiguchi said: without this recheck, the race causes kernel to pin an
irrelevant page, and finally makes kernel crash for refcount mismatch.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Vlastimil Babka
fd901c9538 mm: wake kcompactd before kswapd's short sleep
When kswapd goes to sleep it checks if the node is balanced and at first
it sleeps only for HZ/10 time, then rechecks if the node is still
balanced and nobody has woken it during the initial sleep.  Only then it
goes fully sleep until an allocation slowpath wakes it up again.

For higher-order allocations, waking up kcompactd is done only before
the full sleep.  This turns out to be an issue in case another
high-order allocation fails during the initial sleep.  It will wake
kswapd up, however kswapd considers the zone balanced from the order-0
perspective, and will just quickly try to sleep again.  So if there's a
longer stream of high-order allocations hitting the slowpath and waking
up kswapd, it might never actually wake up kcompactd, which may be
considered a regression from kswapd-based compaction.  In the worst
case, it might be that a single allocation that cannot direct
reclaim/compact itself is waking kswapd in the retry loop and preventing
kcompactd from being woken up and unblocking it.

This patch makes sure kcompactd is woken up in such situations by simply
moving the wakeup before the short initial sleep.  More efficient
solution would be to wake kcompactd immediately instead of kswapd if the
node is already order-0 balanced, but in that case we should also move
reset_isolation_suitable() call to kcompactd so it's not adding to the
allocator's latency.  Since it's late in the 4.6 cycle, let's go with
the simpler change for now.

Fixes: accf62422b ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Minchan Kim
d7e69488bd mm/hwpoison: fix wrong num_poisoned_pages accounting
Currently, migration code increses num_poisoned_pages on *failed*
migration page as well as successfully migrated one at the trial of
memory-failure.  It will make the stat wrong.  As well, it marks the
page as PG_HWPoison even if the migration trial failed.  It would mean
we cannot recover the corrupted page using memory-failure facility.

This patches fixes it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Minchan Kim
b06bad17c7 mm: call swap_slot_free_notify() with page lock held
Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in
page_swap_info.  The reason is that page_endio in rw_page unlocks the
page if read I/O is completed so we need to hold a PG_lock again to
check PageSwapCache.  Otherwise, the page can be removed from swapcache.

  Kernel BUG at c00f9040 [verbose debug info unavailable]
  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
  Modules linked in:
  CPU: 4 PID: 13446 Comm: RenderThread Tainted: G        W 3.10.84-g9f14aec-dirty #73
  task: c3b73200 ti: dd192000 task.ti: dd192000
  PC is at page_swap_info+0x10/0x2c
  LR is at swap_slot_free_notify+0x18/0x6c
  pc : [<c00f9040>]    lr : [<c00f5560>]    psr: 400f0113
  sp : dd193d78  ip : c2deb1e4  fp : da015180
  r10: 00000000  r9 : 000200da  r8 : c120fe08
  r7 : 00000000  r6 : 00000000  r5 : c249a6c0  r4 : = c249a6c0
  r3 : 00000000  r2 : 40080009  r1 : 200f0113  r0 : = c249a6c0
  ..<snip> ..
  Call Trace:
    page_swap_info+0x10/0x2c
    swap_slot_free_notify+0x18/0x6c
    swap_readpage+0x90/0x11c
    read_swap_cache_async+0x134/0x1ac
    swapin_readahead+0x70/0xb0
    handle_pte_fault+0x320/0x6fc
    handle_mm_fault+0xc0/0xf0
    do_page_fault+0x11c/0x36c
    do_DataAbort+0x34/0x118

Fixes: 3f2b1a04f4 ("zram: revive swap_slot_free_notify")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Tested-by: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Minchan Kim
7bf52fb891 mm: vmscan: reclaim highmem zone if buffer_heads is over limit
We have been reclaimed highmem zone if buffer_heads is over limit but
commit 6b4f7799c6 ("mm: vmscan: invoke slab shrinkers from
shrink_zone()") changed the behavior so it doesn't reclaim highmem zone
although buffer_heads is over the limit.  This patch restores the logic.

Fixes: 6b4f7799c6 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Gerald Schaefer
28093f9f34 numa: fix /proc/<pid>/numa_maps for THP
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>	[4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Konstantin Khlebnikov
3486b85a29 mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.

This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.

Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes: 78f11a2557 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups")
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Kirill A. Shutemov
aa88b68c3b thp: keep huge zero page pinned until tlb flush
Andrea has found[1] a race condition on MMU-gather based TLB flush vs
split_huge_page() or shrinker which frees huge zero under us (patch 1/2
and 2/2 respectively).

With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
page pinned until flush is complete and the pin prevents the page from
being split under us.

We still need patch 2/2.  This is simplified version of Andrea's patch.
We don't need fancy encoding.

[1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Ingo Molnar
8efd755ac2 mm/mmu_context, sched/core: Fix mmu_context.h assumption
Some architectures (such as Alpha) rely on include/linux/sched.h definitions
in their mmu_context.h files.

So include sched.h before mmu_context.h.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28 11:44:19 +02:00
Tejun Heo
264a0ae164 memcg: relocate charge moving from ->attach to ->post_attach
Hello,

So, this ended up a lot simpler than I originally expected.  I tested
it lightly and it seems to work fine.  Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?

Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads.  This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.

There's no reason to perform charge moving from ->attach which is deep
in the task migration path.  Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.

* move_charge_struct->mm is added and ->can_attach is now responsible
  for pinning and recording the target mm.  mem_cgroup_clear_mc() is
  updated accordingly.  This also simplifies mem_cgroup_move_task().

* mem_cgroup_move_task() is now called from ->post_attach instead of
  ->attach.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Debugged-and-tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes: 1ed1328792 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Cc: <stable@vger.kernel.org> # 4.4+
2016-04-25 15:45:14 -04:00
Linus Torvalds
2e57259913 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A few fixes for the current series. This contains:

   - Two fixes for NVMe:

     One fixes a reset race that can be triggered by repeated
     insert/removal of the module.

     The other fixes an issue on some platforms, where we get probe
     timeouts since legacy interrupts isn't working.  This used not to
     be a problem since we had the worker thread poll for completions,
     but since that was killed off, it means those poor souls can't
     successfully probe their NVMe device.  Use a proper IRQ check and
     probe (msi-x -> msi ->legacy), like most other drivers to work
     around this.  Both from Keith.

   - A loop corruption issue with offset in iters, from Ming Lei.

   - A fix for not having the partition stat per cpu ref count
     initialized before sending out the KOBJ_ADD, which could cause user
     space to access the counter prior to initialization.  Also from
     Ming Lei.

   - A fix for using the wrong congestion state, from Kaixu Xia"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: loop: fix filesystem corruption in case of aio/dio
  NVMe: Always use MSI/MSI-x interrupts
  NVMe: Fix reset/remove race
  writeback: fix the wrong congested state variable definition
  block: partition: initialize percpuref before sending out KOBJ_ADD
2016-04-15 15:44:10 -07:00
Linus Torvalds
a1f983174d Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull mm gup cleanup from Ingo Molnar:
 "This removes the ugly get-user-pages API hack, now that all upstream
  code has been migrated to it"

("ugly" is putting it mildly. But it worked.. - Linus)

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
2016-04-14 19:31:34 -07:00
Al Viro
b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Ingo Molnar
c12d2da56d mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in:

  cde70140fe ("mm/gup: Overload get_user_pages() functions")

... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.

This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:

  old:
	ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

  new:
	ret = get_user_pages(address, 1, 1, 0, &page, NULL);

... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.

Now, with its work done, get rid of all of that complication and ugliness:

    3 files changed, 16 insertions(+), 140 deletions(-)

... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.

Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:

  cb107161df ("Convert straggling drivers to new six-argument get_user_pages()")

There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.

After this commit any old API usage will break the build.

[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-07 10:46:14 +02:00
Linus Torvalds
4a2d057e4f Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
 "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
  ago with promise that one day it will be possible to implement page
  cache with bigger chunks than PAGE_SIZE.

  This promise never materialized.  And unlikely will.

  Let's stop pretending that pages in page cache are special.  They are
  not.

  The first patch with most changes has been done with coccinelle.  The
  second is manual fixups on top.

  The third patch removes macros definition"

[ I was planning to apply this just before rc2, but then I spaced out,
  so here it is right _after_ rc2 instead.

  As Kirill suggested as a possibility, I could have decided to only
  merge the first two patches, and leave the old interfaces for
  compatibility, but I'd rather get it all done and any out-of-tree
  modules and patches can trivially do the converstion while still also
  working with older kernels, so there is little reason to try to
  maintain the redundant legacy model.    - Linus ]

* PAGE_CACHE_SIZE-removal:
  mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
  mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
  mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04 10:50:24 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Neil Zhang
ec3b688250 mm/page_isolation.c: fix the function comments
Commit fea85cff11 ("mm/page_isolation.c: return last tested pfn rather
than failure indicator") changed the meaning of the return value.  Let's
change the function comments as well.

Signed-off-by: Neil Zhang <neilzhang1123@hotmail.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Michal Hocko
af8e15cc85 oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head
Commit bb29902a75 ("oom, oom_reaper: protect oom_reaper_list using
simpler way") has simplified the check for tasks already enqueued for
the oom reaper by checking tsk->oom_reaper_list != NULL.  This check is
not sufficient because the tsk might be the head of the queue without
any other tasks queued and then we would simply lockup looping on the
same task.  Fix the condition by checking for the head as well.

Fixes: bb29902a75 ("oom, oom_reaper: protect oom_reaper_list using simpler way")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Nadav Amit
858eaaa711 mm/rmap: batched invalidations should use existing api
The recently introduced batched invalidations mechanism uses its own
mechanism for shootdown.  However, it does wrong accounting of
interrupts (e.g., inc_irq_stat is called for local invalidations),
trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
may break some platforms as it bypasses the invalidation mechanisms of
Xen and SGI UV.

This patch reuses the existing TLB flushing mechnaisms instead.  We use
NULL as mm to indicate a global invalidation is required.

Fixes 72b252aed5 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Xishi Qiu
6f25a14a70 mm: fix invalid node in alloc_migrate_target()
It is incorrect to use next_node to find a target node, it will return
MAX_NUMNODES or invalid node.  This will lead to crash in buddy system
allocation.

Fixes: c8721bbbdd ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Laura Abbott" <lauraa@codeaurora.org>
Cc: Hui Zhu <zhuhui@xiaomi.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Alexander Potapenko
0b355eaaaa mm, kasan: fix compilation for CONFIG_SLAB
Add the missing argument to set_track().

Fixes: cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 17:03:37 -05:00
Kaixu Xia
c877ef8ae7 writeback: fix the wrong congested state variable definition
The right variable definition should be wb_congested_state that
include WB_async_congested and WB_sync_congested. So fix it.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-31 12:26:25 -06:00
Kirill A. Shutemov
0fda2788b0 thp: fix typo in khugepaged_scan_pmd()
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Nicolai Stange
e7080a439a mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
 - generic_file_read_iter() gets called with a zero read length,
 - the read offset is at a page boundary,
 - IOCB_DIRECT is not set
-  and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.

Since roundup_pow_of_two(0) is undefined, UBSAN reports

  UBSAN: Undefined behaviour in include/linux/log2.h:63:13
  shift exponent 64 is too large for 64-bit type 'long unsigned int'
  CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
  [...]
  Call Trace:
   [...]
   [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
   [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
   [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
   [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
   [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
   [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
   [...]
   [<ffffffff81510b06>] __vfs_read+0x256/0x3d0
   [...]

when get_init_ra_size() gets called from ondemand_readahead().

The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since

  1UL << (sizeof(unsigned long) * 8)

evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.

What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().

According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.

Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Alexander Potapenko
cd11016e5f mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks.  The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator.  Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.

[akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabinin@virtuozzo.com: comment style fixes]
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Alexander Potapenko
505f5dcb1c mm, kasan: add GFP flags to KASAN API
Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00