kernel-ark/Documentation/filesystems
David Rientjes a63d83f427 oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
..
caching fscache: convert object to use workqueue instead of slow-work 2010-07-22 22:58:34 +02:00
configfs docsrc: build Documentation/ sources 2008-08-12 16:07:30 -07:00
nfs ipconfig: document DHCP hostname and DNS record 2010-06-03 03:18:18 -07:00
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
9p.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
00-INDEX ceph: some documentations fixes 2010-03-29 09:52:11 -07:00
adfs.txt
affs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
afs.txt AFS: Documentation updates 2009-08-19 10:40:13 -07:00
autofs4-mount-control.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
automount-support.txt
befs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
bfs.txt remove mention of CONFIG_KMOD from documentation 2008-07-22 19:24:29 +10:00
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
ceph.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
cifs.txt
coda.txt
cramfs.txt
debugfs.txt Document the debugfs API 2009-06-06 10:28:14 -06:00
dentry-locking.txt rcu: 1Q2010 update for RCU documentation 2010-01-16 10:25:22 +01:00
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking
dlmfs.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
dnotify_test.c Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
dnotify.txt Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
ecryptfs.txt
exofs.txt trivial: some small fixes in exofs documentation 2009-12-10 09:59:16 +02:00
ext2.txt Doc fix: ext2 can only have 32,000 subdirs, not 32,768 2009-06-18 13:03:44 -07:00
ext3.txt ext3: make barrier options consistent with ext4 2010-05-21 19:30:41 +02:00
ext4.txt ext4: Update documentation to correct the inode_readahead_blks option name 2009-12-24 17:51:42 -05:00
fiemap.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
gfs2-glocks.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
gfs2-uevents.txt GFS2: Add a document explaining GFS2's uevents 2009-08-17 11:11:41 +01:00
gfs2.txt GFS2: docs update 2010-03-29 14:28:52 +01:00
hfs.txt
hfsplus.txt
hpfs.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
inotify.txt
isofs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
jfs.txt
Locking drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
locks.txt
logfs.txt fix "seperate" typos in comments 2010-05-10 11:56:30 +02:00
Makefile Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
mandatory-locking.txt
ncpfs.txt ncpfs: remove dead URL from documentation 2009-09-23 07:39:42 -07:00
nilfs2.txt nilfs2: add nodiscard mount option 2010-07-23 10:02:12 +09:00
ntfs.txt NTFS: update homepage 2008-09-02 19:21:37 -07:00
ocfs2.txt ocfs2: Add dir_resv_level mount option 2010-05-05 18:18:07 -07:00
omfs.txt omfs: add filesystem documentation 2008-07-26 12:00:05 -07:00
porting nfs: new subdir Documentation/filesystems/nfs 2009-10-27 19:34:04 -04:00
proc.txt oom: badness heuristic rewrite 2010-08-09 20:45:02 -07:00
quota.txt quota: documentation for sending "below quota" messages via netlink and tiny doc update 2008-08-12 16:07:27 -07:00
ramfs-rootfs-initramfs.txt Trivial Documentation/filesystems/ramfs-rootfs-initramfs.txt fix 2008-11-30 11:40:56 -08:00
relay.txt relay: add buffer-only channels; useful for early logging 2008-07-26 12:00:04 -07:00
romfs.txt
seq_file.txt seq_file: use proc_create() in documentation 2009-12-16 07:20:07 -08:00
sharedsubtree.txt add several pieces to shared subtree documentation 2010-03-03 13:00:23 -05:00
smbfs.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
spufs.txt
squashfs.txt squashfs: update documentation to include description of xattr layout 2010-05-26 01:12:26 +01:00
sysfs-pci.txt PCI: Allow read/write access to sysfs I/O port resources 2010-07-30 09:32:08 -07:00
sysfs-tagging.txt sysfs-namespaces: add a high-level Documentation file 2010-05-21 09:37:31 -07:00
sysfs.txt sysfs: Fix one more signature discrepancy between sysfs implementation and docs. 2010-08-05 13:53:34 -07:00
sysv-fs.txt
tmpfs.txt mempolicy: document cpuset interaction with tmpfs mpol mount option 2010-05-25 08:06:57 -07:00
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt
vfat.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
vfs.txt fs: introduce new truncate sequence 2010-05-27 22:15:33 -04:00
xfs-delayed-logging-design.txt xfs: remove done roadmap item from xfs-delayed-logging-design.txt 2010-06-03 16:22:29 +10:00
xfs.txt xfs: remove obsolete osyncisosync mount option 2010-07-26 13:16:51 -05:00
xip.txt DOC: update xip method info 2008-11-12 17:17:17 -08:00