e11da5b4fd
There have been numerous reports of stalls that pointed at the problem being somewhere in the VM. There are multiple roots to the problems which means dealing with any of the root problems in isolation is tricky to justify on their own and they would still need integration testing. This patch series puts together two different patch sets which in combination should tackle some of the root causes of latency problems being reported. Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the most important results is being able to calculate the scanning/reclaim ratio as a measure of the amount of work being done by page reclaim. Patch 2 accounts for time spent in congestion_wait. Patches 3-6 were originally developed by Kosaki Motohiro but reworked for this series. It has been noted that lumpy reclaim is far too aggressive and trashes the system somewhat. As SLUB uses high-order allocations, a large cost incurred by lumpy reclaim will be noticeable. It was also reported during transparent hugepage support testing that lumpy reclaim was trashing the system and these patches should mitigate that problem without disabling lumpy reclaim. Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait(). wait_iff_congested() only sleeps if there is a BDI that is currently congested. Patch 8 notes that any BDI being congested is not necessarily a problem because there could be multiple BDIs of varying speeds and numberous zones. It attempts to track when a zone being reclaimed contains many pages backed by a congested BDI and if so, reclaimers wait on the congestion queue. I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine had 3G of RAM and the CPUs were X86: Intel P4 2-core X86-64: AMD Phenom 4-core PPC64: PPC970MP Each used a single disk and the onboard IO controller. Dirty ratio was left at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to keep this report short. Four kernels were tested each based on v2.6.36-rc4 traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better waitcongest-v2r3: Patches 1-7 to only wait on congestion waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO The tests run were as follows kernbench compile-based benchmark. Smoke test performance sysbench OLTP read-only benchmark. Will be re-run in the future as read-write micro-mapped-file-stream This is a micro-benchmark from Johannes Weiner that accesses a large sparse-file through mmap(). It was configured to run in only single-CPU mode but can be indicative of how well page reclaim identifies suitable pages. stress-highalloc Tries to allocate huge pages under heavy load. kernbench, iozone and sysbench did not report any performance regression on any machine. sysbench did pressure the system lightly and there was reclaim activity but there were no difference of major interest between the kernels. X86-64 micro-mapped-file-stream traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4 pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%) pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%) pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%) pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%) pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%) pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%) pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%) pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%) pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%) allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%) These are based on the raw figures taken from /proc/vmstat. It's a rough measure of reclaim activity. Note that allocstall counts are higher because we are entering direct reclaim more often as a result of not sleeping in congestion. In itself, it's not necessarily a bad thing. It's easier to get a view of what happened from the vmscan tracepoint report. FTrace Reclaim Statistics: vmscan traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4 Direct reclaims 443 273 513 1568 Direct reclaim pages scanned 305968 280402 600825 957933 Direct reclaim pages reclaimed 43503 19005 30327 117191 Direct reclaim write file async I/O 0 0 0 0 Direct reclaim write anon async I/O 0 3 4 12 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 187649 132338 191695 267701 Kswapd wakeups 3 1 4 1 Kswapd pages scanned 4599269 4454162 4296815 3891906 Kswapd pages reclaimed |
||
---|---|---|
.. | ||
ABI | ||
accounting | ||
acpi | ||
aoe | ||
arm | ||
auxdisplay | ||
blackfin | ||
block | ||
blockdev | ||
cdrom | ||
cgroups | ||
connector | ||
console | ||
cpu-freq | ||
cpuidle | ||
cris | ||
crypto | ||
development-process | ||
device-mapper | ||
DocBook | ||
driver-model | ||
dvb | ||
early-userspace | ||
fault-injection | ||
fb | ||
filesystems | ||
firmware_class | ||
frv | ||
hwmon | ||
i2c | ||
i2o | ||
ia64 | ||
ide | ||
infiniband | ||
input | ||
ioctl | ||
isdn | ||
ja_JP | ||
kbuild | ||
kdump | ||
ko_KR | ||
kvm | ||
laptops | ||
lguest | ||
m68k | ||
make | ||
mips | ||
misc-devices | ||
mmc | ||
mn10300 | ||
mtd | ||
namespaces | ||
netlabel | ||
networking | ||
parisc | ||
PCI | ||
pcmcia | ||
power | ||
powerpc | ||
pps | ||
prctl | ||
RCU | ||
s390 | ||
scheduler | ||
scsi | ||
serial | ||
sh | ||
sound | ||
sparc | ||
spi | ||
sysctl | ||
telephony | ||
thermal | ||
timers | ||
trace | ||
uml | ||
usb | ||
video4linux | ||
vm | ||
w1 | ||
watchdog | ||
wimax | ||
x86 | ||
zh_CN | ||
.gitignore | ||
00-INDEX | ||
apparmor.txt | ||
applying-patches.txt | ||
atomic_ops.txt | ||
bad_memory.txt | ||
basic_profiling.txt | ||
binfmt_misc.txt | ||
braille-console.txt | ||
bt8xxgpio.txt | ||
btmrvl.txt | ||
BUG-HUNTING | ||
bus-virt-phys-mapping.txt | ||
cachetlb.txt | ||
Changes | ||
circular-buffers.txt | ||
coccinelle.txt | ||
CodingStyle | ||
cpu-hotplug.txt | ||
cpu-load.txt | ||
cputopology.txt | ||
credentials.txt | ||
dcdbas.txt | ||
debugging-modules.txt | ||
debugging-via-ohci1394.txt | ||
dell_rbu.txt | ||
devices.txt | ||
DMA-API-HOWTO.txt | ||
DMA-API.txt | ||
DMA-attributes.txt | ||
DMA-ISA-LPC.txt | ||
dmaengine.txt | ||
dontdiff | ||
dynamic-debug-howto.txt | ||
edac.txt | ||
eisa.txt | ||
email-clients.txt | ||
feature-removal-schedule.txt | ||
flexible-arrays.txt | ||
futex-requeue-pi.txt | ||
gcov.txt | ||
gpio.txt | ||
highuid.txt | ||
HOWTO | ||
hw_random.txt | ||
init.txt | ||
initrd.txt | ||
intel_txt.txt | ||
Intel-IOMMU.txt | ||
io_ordering.txt | ||
io-mapping.txt | ||
iostats.txt | ||
IPMI.txt | ||
IRQ-affinity.txt | ||
IRQ.txt | ||
irqflags-tracing.txt | ||
isapnp.txt | ||
java.txt | ||
kernel-doc-nano-HOWTO.txt | ||
kernel-docs.txt | ||
kernel-parameters.txt | ||
keys-request-key.txt | ||
keys.txt | ||
kmemcheck.txt | ||
kmemleak.txt | ||
kobject.txt | ||
kprobes.txt | ||
kref.txt | ||
ldm.txt | ||
leds-class.txt | ||
leds-lp3944.txt | ||
local_ops.txt | ||
lockdep-design.txt | ||
lockstat.txt | ||
logo.gif | ||
logo.txt | ||
magic-number.txt | ||
Makefile | ||
ManagementStyle | ||
mca.txt | ||
md.txt | ||
memory-barriers.txt | ||
memory-hotplug.txt | ||
memory.txt | ||
mono.txt | ||
mutex-design.txt | ||
nmi_watchdog.txt | ||
nommu-mmap.txt | ||
numastat.txt | ||
oops-tracing.txt | ||
padata.txt | ||
parport-lowlevel.txt | ||
parport.txt | ||
pi-futex.txt | ||
pnp.txt | ||
preempt-locking.txt | ||
printk-formats.txt | ||
prio_tree.txt | ||
rbtree.txt | ||
rfkill.txt | ||
robust-futex-ABI.txt | ||
robust-futexes.txt | ||
rt-mutex-design.txt | ||
rt-mutex.txt | ||
rtc.txt | ||
SAK.txt | ||
SecurityBugs | ||
SELinux.txt | ||
serial-console.txt | ||
sgi-ioc4.txt | ||
sgi-visws.txt | ||
SM501.txt | ||
Smack.txt | ||
sparse.txt | ||
spinlocks.txt | ||
stable_api_nonsense.txt | ||
stable_kernel_rules.txt | ||
SubmitChecklist | ||
SubmittingDrivers | ||
SubmittingPatches | ||
svga.txt | ||
sysfs-rules.txt | ||
sysrq.txt | ||
tomoyo.txt | ||
unaligned-memory-access.txt | ||
unicode.txt | ||
unshare.txt | ||
VGA-softcursor.txt | ||
vgaarbiter.txt | ||
video-output.txt | ||
volatile-considered-harmful.txt | ||
workqueue.txt | ||
zorro.txt |