kernel-ark/mm
KOSAKI Motohiro b954185214 mm: size of quicklists shouldn't be proportional to the number of CPUs
Quicklists store pages for each CPU as caches.  (Each CPU can cache
node_free_pages/16 pages)

It is used for page table cache.  exit() will increase the cache size,
while fork() consumes it.

So for example if an apache-style application runs (one parent and many
child model), one CPU process will fork() while another CPU will process
the middleware work and exit().

At that time, the CPU on which the parent runs doesn't have page table
cache at all.  Others (on which children runs) have maximum caches.

	QList_max = (#ofCPUs - 1) x Free / 16
	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

So, How much quicklist memory is used in the maximum case?

This is proposional to # of CPUs because the limit of per cpu quicklist
cache doesn't see the number of cpus.

Above calculation mean

	 Number of CPUs per node            2    4    8   16
	 ==============================  ====================
	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%

Wow! Quicklist can spend about 50% memory at worst case.

My demonstration program is here
--------------------------------------------------------------------------------
#define _GNU_SOURCE

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define BUFFSIZE 512

int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
{
  FILE *fd;
  char *ret, buffer[BUFFSIZE];
  int cpu = 1;

  fd = fopen("/proc/cpuinfo", "r");
  if (fd == NULL) {
    perror("fopen(/proc/cpuinfo)");
    exit(EXIT_FAILURE);
  }
  while (1) {
    ret = fgets(buffer, BUFFSIZE, fd);
    if (ret == NULL)
      break;
    if (!strncmp(buffer, "processor", 9))
      cpu = atoi(strchr(buffer, ':') + 2);
  }
  fclose(fd);
  return cpu;
}

void cpu_bind(int cpu)	/* bind current process to one cpu */
{
  cpu_set_t mask;
  int ret;

  CPU_ZERO(&mask);
  CPU_SET(cpu, &mask);
  ret = sched_setaffinity(0, sizeof(mask), &mask);
  if (ret == -1) {
    perror("sched_setaffinity()");
    exit(EXIT_FAILURE);
  }
  sched_yield();	/* not necessary */
}

#define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
#define FORK_INTERVAL 1	/* 1 second */

main(int argc, char *argv[])
{
  int cpu_max, nextcpu;
  long pagesize;
  pid_t pid;

  /* set max number of logical cpu */
  if (argc > 1)
    cpu_max = atoi(argv[1]) - 1;
  else
    cpu_max = max_cpu();

  /* get the page size */
  pagesize = sysconf(_SC_PAGESIZE);
  if (pagesize == -1) {
    perror("sysconf(_SC_PAGESIZE)");
    exit(EXIT_FAILURE);
  }

  /* prepare parent process */
  cpu_bind(0);
  nextcpu = cpu_max;

loop:

  /* select destination cpu for child process by round-robin rule */
  if (++nextcpu > cpu_max)
    nextcpu = 1;

  pid = fork();

  if (pid == 0) { /* child action */

    char *p;
    int i;

    /* consume page tables */
    p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    i = MMAP_SIZE / pagesize;
    while (i-- > 0) {
      *p = 1;
      p += pagesize;
    }

    /* move to other cpu */
    cpu_bind(nextcpu);
/*
    printf("a child moved to cpu%d after mmap().\n", nextcpu);
    fflush(stdout);
 */

    /* back page tables to pgtable_quicklist */
    exit(0);

  } else if (pid > 0) { /* parent action */

    sleep(FORK_INTERVAL);
    waitpid(pid, NULL, WNOHANG);

  }

  goto loop;
}
----------------------------------------

When above program which does task migration runs, my 8GB box spends
800MB of memory for quicklist.  This is not memory leak but doesn't seem
good.

% cat /proc/meminfo

MemTotal:        7701568 kB
MemFree:         4724672 kB
(snip)
Quicklists:       844800 kB

because

- My machine spec is
	number of numa node: 2
	number of cpus:      8 (4CPU x2 node)
        total mem:           8GB (4GB x2 node)
        free mem:            about 5GB

- Then, 4.7GB x 16% ~= 880MB.
  So, Quicklist can use 800MB.

So, if following spec machine run that program

   CPUs: 64 (8cpu x 8node)
   Mem:  1TB (128GB x8node)

Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.

So, I don't like cache policies which is proportional to # of cpus.

My patch changes the number of caches
from:
   per-cpu-cache-amount = memory_on_node / 16
to
   per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Tested-by: David Miller <davem@davemloft.net>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02 19:21:38 -07:00
..
allocpercpu.c mm/allocpercpu.c: make 4 functions static 2008-07-26 12:00:12 -07:00
backing-dev.c mm: bdi: fix race in bdi_class device creation 2008-05-20 13:31:53 -07:00
bootmem.c bootmem: fix aligning of node-relative indexes and offsets 2008-08-20 15:40:31 -07:00
bounce.c
dmapool.c dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too 2008-04-28 08:58:20 -07:00
fadvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap_xip.c mm: xip/ext2 fix block allocation race 2008-08-20 15:40:32 -07:00
filemap.c VFS: fix dio write returning EIO when try_to_release_page fails 2008-09-02 19:21:37 -07:00
fremap.c mmu-notifiers: core 2008-07-28 16:30:21 -07:00
highmem.c highmem: Export totalhigh_pages. 2008-07-19 22:39:46 -07:00
hugetlb.c allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 2008-08-12 16:07:28 -07:00
internal.h mm: export prep_compound_page to mm 2008-07-24 10:47:17 -07:00
Kconfig mm: Make generic weak get_user_pages_fast and EXPORT_GPL it 2008-08-12 17:52:53 +10:00
maccess.c kgdb: fix optional arch functions and probe_kernel_* 2008-04-17 20:05:39 +02:00
madvise.c madvise: update function comment of madvise_dontneed 2008-07-30 09:41:45 -07:00
Makefile mmu-notifiers: core 2008-07-28 16:30:21 -07:00
memcontrol.c memcg: fix oops in mem_cgroup_shrink_usage 2008-08-12 16:07:28 -07:00
memory_hotplug.c memory-hotplug: add sysfs removable attribute for hotplug memory remove 2008-07-24 10:47:21 -07:00
memory.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
mempolicy.c do_migrate_pages(): remove unused variable 2008-08-12 16:07:29 -07:00
mempool.c
migrate.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
mincore.c mm: remove nopage 2008-04-28 08:58:18 -07:00
mlock.c mlock() fix return values 2008-08-04 16:58:45 -07:00
mm_init.c mm: mminit_loglevel cannot be __meminitdata anymore 2008-08-20 15:40:30 -07:00
mmap.c Merge branch 'core/locking' into core/urgent 2008-08-12 00:11:49 +02:00
mmu_notifier.c mmu-notifiers: core 2008-07-28 16:30:21 -07:00
mmzone.c mm: filter based on a nodemask as well as a gfp_mask 2008-04-28 08:58:19 -07:00
mprotect.c mmu-notifiers: core 2008-07-28 16:30:21 -07:00
mremap.c mmu-notifiers: core 2008-07-28 16:30:21 -07:00
msync.c
nommu.c nommu: Provide vmalloc_exec(). 2008-08-04 16:01:47 +09:00
oom_kill.c security: Fix setting of PF_SUPERPRIV by __capable() 2008-08-14 22:59:43 +10:00
page_alloc.c mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data 2008-09-02 19:21:37 -07:00
page_io.c
page_isolation.c Remove '#include <stddef.h>' from mm/page_isolation.c 2008-09-02 09:29:01 +01:00
page-writeback.c mm: spinlock tree_lock 2008-07-26 12:00:06 -07:00
pagewalk.c pagemap: pass mm into pagewalkers 2008-06-12 18:05:41 -07:00
pdflush.c pdflush: use time_after() instead of open-coding it 2008-07-25 10:53:28 -07:00
prio_tree.c
quicklist.c mm: size of quicklists shouldn't be proportional to the number of CPUs 2008-09-02 19:21:38 -07:00
readahead.c mm: readahead scan lockless 2008-07-26 12:00:06 -07:00
rmap.c mm: dirty page tracking race fix 2008-08-20 15:40:32 -07:00
shmem_acl.c [PATCH] sanitize ->permission() prototype 2008-07-26 20:53:14 -04:00
shmem.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
slab.c mm: unexport ksize 2008-07-29 23:44:26 +03:00
slob.c mm: unexport ksize 2008-07-29 23:44:26 +03:00
slub.c slub: Disable NUMA remote node defragmentation by default 2008-08-20 21:50:21 +03:00
sparse-vmemmap.c Christoph has moved 2008-07-04 10:40:04 -07:00
sparse.c mm/sparse.c: removed duplicated include 2008-08-12 16:07:30 -07:00
swap_state.c mm: show free swap as signed 2008-08-20 15:40:30 -07:00
swap.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
swapfile.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
thrash.c
tiny-shmem.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2008-03-25 08:57:47 -07:00
truncate.c VFS: fix dio write returning EIO when try_to_release_page fails 2008-09-02 19:21:37 -07:00
util.c mm: Make generic weak get_user_pages_fast and EXPORT_GPL it 2008-08-12 17:52:53 +10:00
vmalloc.c Use WARN() in mm/vmalloc.c 2008-07-26 12:00:07 -07:00
vmscan.c mm: rename page trylock 2008-08-04 21:31:34 -07:00
vmstat.c [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo 2008-08-27 20:09:28 +01:00