kernel-ark/fs
Nick Piggin 557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
..
9p 9PFS: clean up explicit check for mandatory locks 2007-10-09 18:32:46 -04:00
adfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
affs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
afs Merge branch 'locks' of git://linux-nfs.org/~bfields/linux 2007-10-15 16:07:40 -07:00
autofs Replace pid_t in autofs with struct pid reference 2007-05-11 08:29:36 -07:00
autofs4 autofs4: deadlock during create 2007-08-22 19:52:46 -07:00
befs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
bfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
cifs [CIFS] Check return code on failed alloc 2007-08-18 00:15:20 +00:00
coda coda: remove CODA_STORE/CODA_RELEASE upcalls 2007-07-21 17:49:14 -07:00
configfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
cramfs mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
debugfs docbook: fix filesystems content 2007-10-15 17:56:36 -07:00
devpts devpts: add fsnotify create event 2007-05-08 11:14:59 -07:00
dlm Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 2007-10-12 15:49:37 -07:00
ecryptfs [NET]: make netlink user -> kernel interface synchronious 2007-10-10 21:15:29 -07:00
efs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
exportfs knfsd: exportfs: split out reconnecting a dentry from find_exported_dentry 2007-07-17 10:23:06 -07:00
ext2 fix inode_table test in ext234_check_descriptors 2007-07-26 11:35:17 -07:00
ext3 readahead: combine file_ra_state.prev_index/prev_offset into prev_pos 2007-10-16 09:42:52 -07:00
ext4 readahead: combine file_ra_state.prev_index/prev_offset into prev_pos 2007-10-16 09:42:52 -07:00
fat mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
freevxfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
fuse mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
gfs2 Merge branch 'locks' of git://linux-nfs.org/~bfields/linux 2007-10-15 16:07:40 -07:00
hfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
hfsplus mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
hostfs sendfile: remove .sendfile from filesystems that use generic_file_sendfile() 2007-07-10 08:04:13 +02:00
hpfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
hppfs
hugetlbfs hugepage: fix broken check for offset alignment in hugepage mappings 2007-08-31 01:42:23 -07:00
isofs isofs: mounting to regular file may succeed 2007-07-31 15:39:41 -07:00
jbd lockdep: annotate journal_start() 2007-10-11 22:11:12 +02:00
jbd2 mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
jffs2 Merge Linux 2.6.23 2007-10-13 14:43:54 +01:00
jfs more low-hanging fruits - kernel, fs, lib signedness 2007-10-14 12:41:52 -07:00
lockd NFS/SUNRPC: use transport protocol naming 2007-10-09 17:17:53 -04:00
minix mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
msdos
ncpfs NCP: delete test of long-deceased CONFIG_NCPFS_DEBUGDENTRY 2007-07-31 15:39:41 -07:00
nfs Merge branch 'locks' of git://linux-nfs.org/~bfields/linux 2007-10-15 16:07:40 -07:00
nfs_common
nfsd Merge branch 'locks' of git://linux-nfs.org/~bfields/linux 2007-10-15 16:07:40 -07:00
nls NLS: Remove obsolete Makefile entries 2007-07-16 09:05:52 -07:00
ntfs NTFS: Fix a mount time deadlock. 2007-10-12 09:16:30 -07:00
ocfs2 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 2007-10-12 15:49:37 -07:00
openpromfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
partitions fs/partitions/sun.c endianness annotations 2007-10-14 12:41:51 -07:00
proc Merge branch 'locks' of git://linux-nfs.org/~bfields/linux 2007-10-15 16:07:40 -07:00
qnx4 mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
ramfs NOMMU: Fix SYSV IPC SHM 2007-07-31 15:39:36 -07:00
reiserfs quota: fix infinite loop 2007-09-11 17:21:19 -07:00
romfs mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
smbfs more low-hanging fruits - kernel, fs, lib signedness 2007-10-14 12:41:52 -07:00
sysfs sysfs: add copyrights 2007-10-12 14:51:12 -07:00
sysv mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
udf Fix possible NULL pointer dereference in udf_table_free_blocks() 2007-08-31 01:42:22 -07:00
ufs ufs: fix sun state 2007-09-25 08:51:04 -07:00
vfat
xfs Fix up more bio fallout 2007-10-12 00:29:50 -07:00
aio.c AIO: fix cleanup in io_submit_one(...) 2007-10-08 12:58:14 -07:00
anon_inodes.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm 2007-07-17 11:50:26 -07:00
attr.c Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check 2007-07-17 12:00:03 -07:00
bad_inode.c sendfile: remove bad_sendfile() from bad_file_ops 2007-07-10 08:04:15 +02:00
binfmt_aout.c
binfmt_elf_fdpic.c remove ZERO_PAGE 2007-10-16 09:42:53 -07:00
binfmt_elf.c remove ZERO_PAGE 2007-10-16 09:42:53 -07:00
binfmt_em86.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
binfmt_flat.c binfmt_flat: checkpatch fixing minimum support for the blackfin relocations 2007-10-03 23:43:57 +08:00
binfmt_misc.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
binfmt_script.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
binfmt_som.c
bio.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
block_dev.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
buffer.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
char_dev.c unregister_chrdev() return void 2007-07-19 10:04:43 -07:00
compat_ioctl.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2007-10-11 19:40:14 -07:00
compat.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
dcache.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
dcookies.c Remove fs.h from mm.h 2007-07-29 17:09:29 -07:00
direct-io.c remove ZERO_PAGE 2007-10-16 09:42:53 -07:00
dnotify.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
dquot.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
drop_caches.c invalidate_mapping_pages(): add cond_resched 2007-07-16 09:05:36 -07:00
eventfd.c eventfd use waitqueue lock ... 2007-05-18 13:09:34 -07:00
eventpoll.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
exec.c signalfd simplification 2007-09-20 13:19:59 -07:00
fcntl.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
fifo.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
file_table.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
file.c
filesystems.c add filesystem subtype support 2007-05-08 11:15:01 -07:00
fs-writeback.c Fix warnings with !CONFIG_BLOCK 2007-10-10 09:25:57 +02:00
generic_acl.c Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check 2007-07-17 12:00:03 -07:00
inode.c lockdep: per filesystem inode lock class 2007-10-15 14:51:31 +02:00
inotify_user.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
inotify.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
internal.h cleanup compat ioctl handling 2007-05-08 11:15:09 -07:00
ioctl.c drop obsolete sys_ioctl export 2007-07-16 09:05:48 -07:00
ioprio.c
Kconfig Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 2007-10-15 10:47:35 -07:00
Kconfig.binfmt fs: Kill sh dependency for binfmt_flat. 2007-05-21 14:34:00 +09:00
libfs.c fs/libfs.c: >80 columns line break fix 2007-05-09 06:44:57 +02:00
locks.c Rework /proc/locks via seq_files and seq_list helpers 2007-10-09 18:32:46 -04:00
Makefile signal/timer/event: eventfd core 2007-05-11 08:29:36 -07:00
mbcache.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
mpage.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
namei.c fs: remove path_walk export 2007-07-19 10:04:45 -07:00
namespace.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
nfsctl.c nfsctl: use vfs_path_lookup 2007-07-19 10:04:45 -07:00
no-block.c
open.c VFS: fix a race in lease-breaking during truncate 2007-07-31 15:39:42 -07:00
pipe.c sched: affine sync wakeups 2007-10-15 17:00:19 +02:00
pnode.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
pnode.h
posix_acl.c
quota_v1.c
quota_v2.c
quota.c [IA64] Fix build failure in fs/quota.c 2007-07-27 15:40:13 -07:00
read_write.c Cleanup macros for distinguishing mandatory locks 2007-10-09 18:32:46 -04:00
read_write.h
readdir.c ROUND_UP macro cleanup in fs/(select|compat|readdir).c 2007-05-08 11:15:09 -07:00
select.c Fix select on /proc files without ->poll 2007-09-11 17:21:20 -07:00
seq_file.c [FS] seq_file: Introduce the seq_open_private() 2007-10-10 16:55:33 -07:00
signalfd.c signalfd simplification 2007-09-20 13:19:59 -07:00
splice.c readahead: combine file_ra_state.prev_index/prev_offset into prev_pos 2007-10-16 09:42:52 -07:00
stack.c
stat.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
super.c hugetlbfs: handle empty options string 2007-07-16 09:05:46 -07:00
sync.c Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARM 2007-06-28 11:38:30 -07:00
timerfd.c make timerfd return a u64 and fix the __put_user 2007-07-26 11:35:17 -07:00
utimes.c Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check 2007-07-17 12:00:03 -07:00
xattr_acl.c
xattr.c Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check 2007-07-17 12:00:03 -07:00