kernel-ark/Documentation/filesystems
David Howells 201a15428b FS-Cache: Handle pages pending storage that get evicted under OOM conditions
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache.  Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

	kslowd005     D 0000000000000000     0  4253      2 0x00000080
	 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
	 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
	Call Trace:
	 [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
	 [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
	 [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
	 [<ffffffff810885d3>] try_to_release_page+0x32/0x3b
	 [<ffffffff81093203>] shrink_page_list+0x316/0x4ac
	 [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
	 [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
	 [<ffffffff81093aa2>] shrink_list+0x8d/0x8f
	 [<ffffffff81093d1c>] shrink_zone+0x278/0x33c
	 [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
	 [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
	 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
	 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
	 [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
	 [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
	 [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
	 [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
	 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
	 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
	 [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
	 [<ffffffff810b2e82>] do_sync_write+0xe3/0x120
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
	 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89
	 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
	 [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
	 [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

 (1) A page storage operation is being executed by a slow-work thread
     (fscache_write_op()).

 (2) FS-Cache farms the operation out to the cache to perform
     (cachefiles_write_page()).

 (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
     standard write (do_sync_write()) under KERNEL_DS directly from the netfs
     page.

 (4) However, for Ext3 to perform the write, it must allocate some memory, in
     particular, it must allocate at least one page cache page into which it
     can copy the data from the netfs page.

 (5) Under OOM conditions, the memory allocator can't immediately come up with
     a page, so it uses vmscan to find something to discard
     (try_to_free_pages()).

 (6) vmscan finds a clean netfs page it might be able to discard (possibly the
     one it's trying to write out).

 (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
     called with __GFP_WAIT, so the netfs decides to wait for the store to
     complete (__fscache_wait_on_page_write()).

 (8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed.  This means that some data won't make it into the
cache this time.  To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan".  There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.

What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages.  If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:35 +00:00
..
caching FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
configfs
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
9p.txt 9p: Update documentation to add fscache related bits 2009-09-23 13:03:46 -05:00
00-INDEX update Documentation/filesystems/00-INDEX with new nfsd related docs. 2009-04-28 12:54:45 -04:00
adfs.txt
affs.txt
afs.txt AFS: Documentation updates 2009-08-19 10:40:13 -07:00
autofs4-mount-control.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
automount-support.txt
befs.txt
bfs.txt
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
cifs.txt
coda.txt
cramfs.txt
debugfs.txt Document the debugfs API 2009-06-06 10:28:14 -06:00
dentry-locking.txt
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking
dlmfs.txt
dnotify.txt
ecryptfs.txt
exofs.txt exofs: Documentation 2009-03-31 19:44:38 +03:00
Exporting
ext2.txt Doc fix: ext2 can only have 32,000 subdirs, not 32,768 2009-06-18 13:03:44 -07:00
ext3.txt ext3: Update documentation about ext3 quota mount options 2009-10-13 00:06:43 +02:00
ext4.txt Revert "ext4: Remove journal_checksum mount option and enable it by default" 2009-11-02 10:15:27 -08:00
fiemap.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt
gfs2-glocks.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
gfs2-uevents.txt GFS2: Add a document explaining GFS2's uevents 2009-08-17 11:11:41 +01:00
gfs2.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
hfs.txt
hfsplus.txt
hpfs.txt
inotify.txt
isofs.txt isofs: let mode and dmode mount options override rock ridge mode setting 2009-06-18 13:03:45 -07:00
jfs.txt
knfsd-stats.txt Document /proc/fs/nfsd/pool_stats 2009-03-27 19:24:27 -04:00
Locking update Documentation/filesystems/Locking 2009-06-24 08:15:25 -04:00
locks.txt
mandatory-locking.txt
ncpfs.txt ncpfs: remove dead URL from documentation 2009-09-23 07:39:42 -07:00
nfs41-server.txt nfsd: revise 4.1 status documentation 2009-09-21 11:13:45 -04:00
nfs-rdma.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
nfs.txt NFS: Add a dns resolver for use with NFSv4 referrals and migration 2009-08-19 18:22:15 -04:00
nfsroot.txt trivial: fix typo "for for" in multiple files 2009-09-21 15:14:54 +02:00
nilfs2.txt nilfs2: modify list of unsupported features in caveats 2009-06-10 23:41:11 +09:00
ntfs.txt
ocfs2.txt ocfs2: add mount option and Kconfig option for acl 2009-01-05 08:36:52 -08:00
omfs.txt
porting
proc.txt ext4: Use tracepoints for mb_history trace file 2009-09-30 00:32:42 -04:00
quota.txt
ramfs-rootfs-initramfs.txt Trivial Documentation/filesystems/ramfs-rootfs-initramfs.txt fix 2008-11-30 11:40:56 -08:00
relay.txt
romfs.txt
rpc-cache.txt
seq_file.txt Doc: seq_file.txt fix wrong dd command example. 2009-09-10 14:33:35 -06:00
sharedsubtree.txt doc/filesystems: more mount cleanups 2009-09-24 07:20:57 -07:00
smbfs.txt
spufs.txt
squashfs.txt Squashfs: fix documentation typo, Cramfs filesystem limit is 256 MiB 2009-03-05 00:40:13 +00:00
sysfs-pci.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
sysfs.txt driver core: documentation: make it clear that sysfs is optional 2009-07-28 13:45:23 -07:00
sysv-fs.txt
tmpfs.txt hugh: update email address 2009-05-21 13:14:32 -07:00
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt
vfat.txt vfat: change the default from shortname=lower to shortname=mixed 2009-08-01 21:35:25 +09:00
vfs.txt HWPOISON: Define a new error_remove_page address space op for async truncation 2009-09-16 11:50:13 +02:00
xfs.txt [XFS] remove restricted chown parameter from xfs linux 2008-10-30 18:30:09 +11:00
xip.txt DOC: update xip method info 2008-11-12 17:17:17 -08:00