In order to reduce fragmentation, this patch classifies freed pages in two
groups according to their probability of being part of a high order merge.
Pages belonging to a compound whose next-highest buddy is free are more
likely to be part of a high order merge in the near future, so they will
be added at the tail of the freelist. The remaining pages are put at the
front of the freelist.
In this way, the pages that are more likely to cause a big merge are kept
free longer. Consequently there is a tendency to aggregate the
long-living allocations on a subset of the compounds, reducing the
fragmentation.
This heuristic was tested on three machines, x86, x86-64 and ppc64 with
3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
and STREAM for performance and a high-order stress test for huge page
allocations.
KernBench X86
Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
User mean 649.53 ( 0.00%) 650.44 (-0.14%)
System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)
KernBench X86-64
Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)
KernBench PPC64
Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)
Nothing notable for kernbench.
NetPerf UDP X86
64 42.68 ( 0.00%) 42.77 ( 0.21%)
128 85.62 ( 0.00%) 85.32 (-0.35%)
256 170.01 ( 0.00%) 168.76 (-0.74%)
1024 655.68 ( 0.00%) 652.33 (-0.51%)
2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*
NetPerf UDP X86-64
64 148.82 ( 0.00%) 154.92 ( 3.94%)
128 298.96 ( 0.00%) 312.95 ( 4.47%)
256 583.67 ( 0.00%) 626.39 ( 6.82%)
1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
1.64% 2.73%
NetPerf UDP PPC64
64 49.98 ( 0.00%) 50.25 ( 0.54%)
128 98.66 ( 0.00%) 100.95 ( 2.27%)
256 197.33 ( 0.00%) 191.03 (-3.30%)
1024 761.98 ( 0.00%) 785.07 ( 2.94%)
2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)
The tests are run to have confidence limits within 1%. Results marked
with a * were not confident although in this case, it's only outside by
small amounts. Even with some results that were not confident, the
netperf UDP results were generally positive.
NetPerf TCP X86
64 652.25 ( 0.00%)* 648.12 (-0.64%)*
23.80% 22.82%
128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
21.03% 18.90%
256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
1.00% 16.46%
1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
13.37% 11.39%
2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
9.76% 12.48%
3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
6.49% 8.75%
4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
9.85% 8.50%
8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
9.13% 13.04%
16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
7.73% 8.68%
NETPERF TCP X86-64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
5.79% 4.78%
128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
3.68% 6.06%
256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
3.02% 2.10%
1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
5.89% 6.55%
2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
7.08% 7.44%
3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
6.87% 7.33%
4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
6.86% 8.18%
8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
7.49% 5.55%
16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
7.36% 6.49%
NETPERF TCP PPC64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 594.17 ( 0.00%) 596.04 ( 0.31%)*
1.00% 2.29%
128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
1.30% 1.40%
256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
1.25% 1.00%
1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
1.02% 1.00%
2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
1.15% 1.04%
3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
16384 9618.65 ( 0.00%) 9490.10 (-1.35%)
There was a distinct lack of confidence in the X86* figures so I included
what the devation was where the results were not confident. Many of the
results, whether gains or losses were within the standard deviation so no
solid conclusion can be reached on performance impact. Looking at the
figures, only the X86-64 ones look suspicious with a few losses that were
outside the noise. However, the results were so unstable that without
knowing why they vary so much, a solid conclusion cannot be reached.
SYSBENCH X86
sysbench-vanilla pgalloc-delay
1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
2 14901.11 ( 0.00%) 13683.44 (-8.90%)
3 15171.71 ( 0.00%) 14888.25 (-1.90%)
4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
6 14870.33 ( 0.00%) 14845.57 (-0.17%)
7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
SYSBENCH X86-64
1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
2 34276.39 ( 0.00%) 34251.00 (-0.07%)
3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
4 66667.10 ( 0.00%) 66174.69 (-0.74%)
5 66003.91 ( 0.00%) 65685.25 (-0.49%)
6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
7 64933.16 ( 0.00%) 64379.23 (-0.86%)
8 63353.30 ( 0.00%) 63281.22 (-0.11%)
9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
11 62092.81 ( 0.00%) 61787.75 (-0.49%)
12 61330.11 ( 0.00%) 61036.34 (-0.48%)
13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
14 62304.48 ( 0.00%) 62064.90 (-0.39%)
15 63296.48 ( 0.00%) 62875.16 (-0.67%)
16 63951.76 ( 0.00%) 63769.09 (-0.29%)
SYSBENCH PPC64
-sysbench-pgalloc-delay-sysbench
sysbench-vanilla pgalloc-delay
1 7645.08 ( 0.00%) 7467.43 (-2.38%)
2 14856.67 ( 0.00%) 14558.73 (-2.05%)
3 21952.31 ( 0.00%) 21683.64 (-1.24%)
4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
6 27477.10 ( 0.00%) 27337.45 (-0.51%)
7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
8 26642.91 ( 0.00%) 25274.33 (-5.41%)
9 25137.27 ( 0.00%) 24810.06 (-1.32%)
10 24451.99 ( 0.00%) 24275.85 (-0.73%)
11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
12 24234.81 ( 0.00%) 23640.89 (-2.51%)
13 24577.75 ( 0.00%) 24433.50 (-0.59%)
14 25640.19 ( 0.00%) 25116.52 (-2.08%)
15 26188.84 ( 0.00%) 26181.36 (-0.03%)
16 26782.37 ( 0.00%) 26255.99 (-2.00%)
Again, there is little to conclude here. While there are a few losses,
the results vary by +/- 8% in some cases. They are the results of most
concern as there are some large losses but it's also within the variance
typically seen between kernel releases.
The STREAM results varied so little and are so verbose that I didn't
include them here.
The final test stressed how many huge pages can be allocated. The
absolute number of huge pages allocated are the same with or without the
page. However, the "unusability free space index" which is a measure of
external fragmentation was slightly lower (lower is better) throughout the
lifetime of the system. I also measured the latency of how long it took
to successfully allocate a huge page. The latency was slightly lower and
on X86 and PPC64, more huge pages were allocated almost immediately from
the free lists. The improvement is slight but there.
[mel@csn.ul.ie: Tested, reworked for less branches]
[czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
streaming IO first"). Wow, It is 2 years old patch!
Currently, tmpfs file cache is inserted active list at first. This means
that the insertion doesn't only increase numbers of pages in anon LRU, but
it also reduces anon scanning ratio. Therefore, vmscan will get totally
confused. It scans almost only file LRU even though the system has plenty
unused tmpfs pages.
Historically, lru_cache_add_active_anon() was used for two reasons.
1) Intend to priotize shmem page rather than regular file cache.
2) Intend to avoid reclaim priority inversion of used once pages.
But we've lost both motivation because (1) Now we have separate anon and
file LRU list. then, to insert active list doesn't help such priotize.
(2) In past, one pte access bit will cause page activation. then to
insert inactive list with pte access bit mean higher priority than to
insert active list. Its priority inversion may lead to uninteded lru
chun. but it was already solved by commit 645747462 (vmscan: detect
mapped file pages used only once). (Thanks Hannes, you are great!)
Thus, now we can use lru_cache_add_anon() instead.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The alloc_slab_page() in SLUB uses alloc_pages() if node is '-1'. This means
that node validity check in alloc_pages_node is unnecessary and we can use
alloc_pages_exact_node() to avoid comparison and branch as commit
6484eb3e2a ("page allocator: do not check NUMA node ID when the caller
knows the node is valid") did for the page allocator.
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
commit 94b528d (kmemtrace: SLUB hooks for caller-tracking functions)
missed tracing kmalloc_large_node in __kmalloc_node_track_caller. We
should trace it same as __kmalloc_node.
Acked-by: David Rientjes <rientjes@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
I discovered that we can overflow stack if CONFIG_SLUB_DEBUG=y and use slabs
with many objects, since list_slab_objects() and process_slab() use
DECLARE_BITMAP(map, page->objects).
With 65535 bits, we use 8192 bytes of stack ...
Switch these allocations to dynamic allocations.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
- seems what ramfs_get_inode is only locally, make it static.
[AV: the hell it is; it's used by shmem, so shmem needed conversion too
and no, that function can't be made static]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.
The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The entries in xattr handler table should be immutable (ie const)
like other operation tables.
Later patches convert common filesystems. Uncoverted filesystems
will still work, but will generate a compiler warning.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (577 commits)
Staging: ramzswap: Handler for swap slot free callback
swap: Add swap slot free callback to block_device_operations
swap: Add flag to identify block swap devices
Staging: vt6655: use ETH_FRAME_LEN macro instead of custom one
Staging: vt6655: use ETH_DATA_LEN macro instead of custom one
Staging: vt6655: use ETH_FCS_LEN macro instead of custom one
Staging: vt6656: use ETH_HLEN macro instead of custom one
Staging: comedi: quatech_daqp_cs.c Replace eos semaphore with a completion.
Staging: dt3155v4l: remove private memory allocator
Staging: crystalhd: Remove typedefs from driver
Staging: winbond: Fix for pointer name format issue in mds.c
Staging: vt6656: removed custom UCHAR/USHORT/UINT/ULONG/ULONGLONG typedefs
Staging: vt6656: removed custom CHAR/SHORT/INT/LONG typedefs
Staging: comedi: Altered the way printk is used in 8255.c
staging: iio: adis16350 and similar IMU driver
Staging: iio: max1363 Fix two bugs in single_channel_from_ring
Staging: iio: adis16220 extract bin_attribute structures from state
Staging: iio: adis16220 vibration sensor driver
Staging: comedi: Kconfig dependancy fixes
Staging: comedi: fix up build error from last Kconfig changes
...
Conflicts:
drivers/staging/arlan/arlan-main.c
drivers/staging/comedi/drivers/cb_das16_cs.c
drivers/staging/cx25821/cx25821-alsa.c
drivers/staging/dt3155/dt3155_drv.c
drivers/staging/hv/hv.c
drivers/staging/netwave/netwave_cs.c
drivers/staging/wavelan/wavelan.c
drivers/staging/wavelan/wavelan_cs.c
drivers/staging/wlags49_h2/wl_cs.c
This required a bit of hand merging due to the conflicts
that happened in the later .34-rc releases, as well as
some staging driver changing coming in through other trees
(v4l and pcmcia).
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When CONFIG_BLOCK isn't enabled:
mm/page-writeback.c: In function 'laptop_mode_timer_fn':
mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
mm/page-writeback.c:709: error: dereferencing pointer to incomplete type
Fix this by essentially eliminating the laptop sync handlers when
CONFIG_BLOCK isn't set, as most are only used from the block layer code.
The exception is laptop_sync_completion() which is used from sys_sync(),
make that an empty declaration in that case.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit 69b62d01 fixed up most of the places where we would enter
busy schedule() spins when disabling the periodic background
writeback. This fixes up the sb timer so that it doesn't get
hammered on with the delay disabled, and ensures that it gets
rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
gets modified.
bdi_forker_task() also needs to check for !dirty_writeback_centisecs
and use schedule() appropriately, fix that up too.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
ia64: add sparse annotation to __ia64_per_cpu_var()
percpu: implement kernel memory based chunk allocation
percpu: move vmalloc based chunk management into percpu-vm.c
percpu: misc preparations for nommu support
percpu: reorganize chunk creation and destruction
percpu: factor out pcpu_addr_in_first/reserved_chunk() and update per_cpu_ptr_to_phys()
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
This callback is required when RAM based devices are used as swap disks.
One such device is ramzswap which is used as compressed in-memory swap
disk. For such devices, we need a callback as soon as a swap slot is no
longer used to allow freeing memory allocated for this slot. Without this
callback, stale data can quickly accumulate in memory defeating the whole
purpose of such devices.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Added SWP_BLKDEV flag to distinguish block and regular file backed
swap devices. We could also check if a swap is entire block device,
rather than a file, by:
S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
but, I think, simply checking this flag is more convenient.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
perf tools: Add mode to build without newt support
perf symbols: symbol inconsistency message should be done only at verbose=1
perf tui: Add explicit -lslang option
perf options: Type check all the remaining OPT_ variants
perf options: Type check OPT_BOOLEAN and fix the offenders
perf options: Check v type in OPT_U?INTEGER
perf options: Introduce OPT_UINTEGER
perf tui: Add workaround for slang < 2.1.4
perf record: Fix bug mismatch with -c option definition
perf options: Introduce OPT_U64
perf tui: Add help window to show key associations
perf tui: Make <- exit menus too
perf newt: Add single key shortcuts for zoom into DSO and threads
perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
perf newt: Fix the 'A'/'a' shortcut for annotate
perf newt: Make <- exit the ui_browser
x86, perf: P4 PMU - fix counters management logic
perf newt: Make <- zoom out filters
perf report: Report number of events, not samples
perf hist: Clarify events_stats fields usage
...
Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
writeback to kick off writeback of pending dirty inodes, then follow
that up with a WB_SYNC_ALL to wait for it. Since umount already holds
the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
since WB_SYNC_ALL writeback is a data integrity operation and thus
a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
it's a lot slower.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some callers (in memcontrol.c) calls css_is_ancestor() without
rcu_read_lock. Because css_is_ancestor() has to access RCU protected
data, it should be under rcu_read_lock().
This makes css_is_ancestor() itself does safe access to RCU protected
area. (At least, "root" can have refcnt==0 if it's not an ancestor of
"child". So, we need rcu_read_lock().)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ad4ba37537 ("memcg: css_id() must be
called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
message. But Andrew Morton pointed out that the fix doesn't seems sane
and it was just for hidining lockdep messages.
This is a patch for do proper things. Checking again, all places,
accessing without rcu_read_lock, that commit fixies was intentional....
all callers of css_id() has reference count on it. So, it's not necessary
to be under rcu_read_lock().
Considering again, we can use rcu_dereference_check for css_id(). We know
css->id is valid if css->refcnt > 0. (css->id never changes and freed
after css->refcnt going to be 0.)
This patch makes use of rcu_dereference_check() in css_id/depth and remove
unnecessary rcu-read-lock added by the commit.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently page_address_in_vma() compares vma->anon_vma and
page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have
multiple anon_vmas with anon_vma_chain, so current check does not work.
(For anonymous page shared by multiple processes, some verified (page,vma)
pairs return -EFAULT wrongly.)
We can go to checking all anon_vmas in the "same_vma" chain, but it needs
to meet lock requirement. Instead, we can remove anon_vma check safely
because page_address_in_vma() assumes that page and vma are already
checked to belong to the identical process.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ordinarily, application using hugetlbfs will create mappings with
reserves. For shared mappings, these pages are reserved before mmap()
returns success and for private mappings, the caller process is guaranteed
and a child process that cannot get the pages gets killed with sigbus.
An application that uses MAP_NORESERVE gets no reservations and mmap()
will always succeed at the risk the page will not be available at fault
time. This might be used for example on very large sparse mappings where
the developer is confident the necessary huge pages exist to satisfy all
faults even though the whole mapping cannot be backed by huge pages.
Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the
fault handler which proceeds to trigger the OOM-killer. This is
unhelpful.
Even without hugetlbfs mounted, a user using mmap() can trivially trigger
the OOM-killer because VM_FAULT_OOM is returned (will provide example
program if desired - it's a whopping 24 lines long). It could be
considered a DOS available to an unprivileged user.
This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE
where huge pages were not available with SIGBUS instead of triggering the
OOM killer.
This change affects hugetlb_cow() as well. I feel there is a failure case
in there, but I didn't create one. It would need a fairly specific target
in terms of the faulting application and the hugepage pool size. The
hugetlb_no_page() path is much easier to hit but both might as well be
closed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: create rcu_my_thread_group_empty() wrapper
memcg: css_id() must be called under rcu_read_lock()
cgroup: Check task_lock in task_subsys_state()
sched: Fix an RCU warning in print_task()
cgroup: Fix an RCU warning in alloc_css_id()
cgroup: Fix an RCU warning in cgroup_path()
KEYS: Fix an RCU warning in the reading of user keys
KEYS: Fix an RCU warning
Function init_kmem_cache_nodes is incorrect when checking upper limitation of
kmalloc_caches. The breakage was introduced by commit
91efd773c7 ("dma kmalloc handling fixes").
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(),
mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect
calls to css_id(). An additional RCU lockdep splat was reported for
memcg_oom_wake_function(), however, this function is not yet in
mainline as of 2.6.34-rc5.
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Implement an alternate percpu chunk management based on kernel memeory
for nommu SMP architectures. Instead of mapping into vmalloc area,
chunks are allocated as a contiguous kernel memory using
alloc_pages(). As such, percpu allocator on nommu will have the
following restrictions.
* It can't fill chunks on-demand page-by-page. It has to allocate
each chunk fully upfront.
* It can't support sparse chunk for NUMA configurations. SMP w/o mmu
is crazy enough. Let's hope no one does NUMA w/o mmu. :-P
* If chunk size isn't power-of-two multiple of PAGE_SIZE, the
unaligned amount will be wasted on each chunk. So, archs which use
this better align chunk size.
For instructions on how to use this, read the comment on top of
mm/percpu-km.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Graff Yang <graff.yang@gmail.com>
Cc: Sonic Zhang <sonic.adi@gmail.com>
Separate out and move chunk management (creation/desctruction and
[de]population) code into percpu-vm.c which is included by percpu.c
and compiled together. The interface for chunk management is defined
as follows.
* pcpu_populate_chunk - populate the specified range of a chunk
* pcpu_depopulate_chunk - depopulate the specified range of a chunk
* pcpu_create_chunk - create a new chunk
* pcpu_destroy_chunk - destroy a chunk, always preceded by full depop
* pcpu_addr_to_page - translate address to physical address
* pcpu_verify_alloc_info - check alloc_info is acceptable during init
Other than wrapping vmalloc_to_page() inside pcpu_addr_to_page() and
dummy pcpu_verify_alloc_info() implementation, this patch only moves
code around. This separation is to allow alternate chunk management
implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Graff Yang <graff.yang@gmail.com>
Cc: Sonic Zhang <sonic.adi@gmail.com>
Make the following misc preparations for percpu nommu support.
* Remove refernces to vmalloc in common comments as nommu percpu won't
use it.
* Rename chunk->vms to chunk->data and make it void *. Its use is
determined by chunk management implementation.
* Relocate utility functions and add __maybe_unused to functions which
might not be used by different chunk management implementations.
This patch doesn't cause any functional change. This is to allow
alternate chunk management implementation for percpu nommu support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Graff Yang <graff.yang@gmail.com>
Cc: Sonic Zhang <sonic.adi@gmail.com>
Reorganize alloc/free_pcpu_chunk() such that chunk struct alloc/free
live in pcpu_alloc/free_chunk() and the rest in
pcpu_create/destroy_chunk(). While at it, add missing error handling
for chunk->map allocation failure.
This is to allow alternate chunk management implementation for percpu
nommu support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Graff Yang <graff.yang@gmail.com>
Cc: Sonic Zhang <sonic.adi@gmail.com>
Factor out pcpu_addr_in_first/reserved_chunk() from
pcpu_chunk_addr_search() and use it to update per_cpu_ptr_to_phys()
such that it handles first chunk differently from the rest.
This patch doesn't cause any functional change and is to prepare for
percpu nommu support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Graff Yang <graff.yang@gmail.com>
Cc: Sonic Zhang <sonic.adi@gmail.com>
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
coda: move backing-dev.h kernel include inside __KERNEL__
mtd: ensure that bdi entries are properly initialized and registered
Move mtd_bdi_*mappable to mtdcore.c
btrfs: convert to using bdi_setup_and_register()
Catch filesystems lacking s_bdi
drbd: Terminate a connection early if sending the protocol fails
drbd: fix memory leak
Fix JFFS2 sync silent failure
smbfs: add bdi backing to mount session
ncpfs: add bdi backing to mount session
exofs: add bdi backing to mount session
ecryptfs: add bdi backing to mount session
coda: add bdi backing to mount session
cifs: add bdi backing to mount session
afs: add bdi backing to mount session.
9p: add bdi backing to mount session
bdi: add helper function for doing init and register of a bdi for a file system
block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
Check whether the VMA has a vm_ops before calling close, just
like we check vm_ops before calling open a few dozen lines
higher up in the function.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
noop_backing_dev_info is used only as a flag to mark filesystems that
don't have any backing store, like tmpfs, procfs, spufs, etc.
Signed-off-by: Joern Engel <joern@logfs.org>
Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
to the noop_backing_dev_info is not legal and will not result in
them being flushed, but we already catch this condition in
__mark_inode_dirty() when checking for a registered bdi.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The follow_page() function can potentially return -EFAULT so I added
checks for this.
Also I silenced an uninitialized variable warning on my version of gcc
(version 4.3.2).
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If find_mergeable_anon_vma() succeeds but another thread installs
->anon_vma before we take ptl, then allocated == NULL but avc should be
freed. Change the code to check avc != NULL to detect this case.
Also, a couple of whitespace changes to make the critical section more
visible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a futex key happens to be located within a huge page mapped
MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a
page->mapping that will never exist.
See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details
about the problem.
This patch makes page->mapping a poisoned value that includes
PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to
continue but because of PAGE_MAPPING_ANON, the poisoned value is not
dereferenced or used by futex. No other part of the VM should be
dereferencing the page->mapping of a hugetlbfs page as its page cache is
not on the LRU.
This patch fixes the problem with the test case described in the bugzilla.
[akpm@linux-foundation.org: mel cant spel]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <darren@dvhart.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>