Commit Graph

10307 Commits

Author SHA1 Message Date
Christoph Lameter
a10aa57987 vmalloc: show vmalloced areas via /proc/vmallocinfo
Implement a new proc file that allows the display of the currently allocated
vmalloc memory.

It allows to see the users of vmalloc.  That is important if vmalloc space is
scarce (i386 for example).

And it's going to be important for the compound page fallback to vmalloc.
Many of the current users can be switched to use compound pages with fallback.
 This means that the number of users of vmalloc is reduced and page tables no
longer necessary to access the memory.  /proc/vmallocinfo allows to review how
that reduction occurs.

If memory becomes fragmented and larger order allocations are no longer
possible then /proc/vmallocinfo allows to see which compound page allocations
fell back to virtual compound pages.  That is important for new users of
virtual compound pages.  Such as order 1 stack allocation etc that may
fallback to virtual compound pages in the future.

/proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
information leakage.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: CONFIG_MMU=n build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:21 -07:00
Andrew Morton
b454456841 mm: make early_pfn_to_nid() a C function
Fix this (sparc64)

mm/sparse-vmemmap.c: In function `vmemmap_verify':
mm/sparse-vmemmap.c:64: warning: unused variable `pfn'

by switching to a C function which touches its arg.

(reason 3,555 why macros are bad)

Also, the `nid' arg was misnamed.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
Miklos Szeredi
ac6aadb24b mm: rotate_reclaimable_page() cleanup
Clean up messy conditional calling of test_clear_page_writeback() from both
rotate_reclaimable_page() and end_page_writeback().

The only user of rotate_reclaimable_page() is end_page_writeback() so this is
OK.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
Andi Kleen
7edf85aa3c mm: save some bytes in mm_struct by filling holes on 64bit
Save some bytes in mm_struct by filling holes

Putting int values together for better packing on 64bit shrinks sizeof(struct
mm_struct) from 776 bytes to 764 bytes.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes
3842b46de6 mempolicy: small header file cleanup
Removes forward definition of vm_area_struct in linux/mempolicy.h.  We already
get it from the linux/slab.h -> linux/gfp.h include.

Removes the unused mpol_set_vma_default() macro from linux/mempolicy.h.

Removes the extern definition of default_policy since it is only referenced,
as it should be, in mm/mempolicy.c.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes
4c50bc0116 mempolicy: add MPOL_F_RELATIVE_NODES flag
Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
nodemasks passed via set_mempolicy() or mbind() should be considered relative
to the current task's mems_allowed.

When the mempolicy is created, the passed nodemask is folded and mapped onto
the current task's mems_allowed.  For example, consider a task using
set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
5-7 (the second, third, and fourth node of mems_allowed).

If the same task is attached to a cpuset, the mempolicy nodemask is rebound
each time the mems are changed.  Some possible rebinds and results are:

	mems			result
	1-3			1-3
	1-7			2-4
	1,5-6			1,5-6
	1,5-7			5-7

Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
to the resultant nodemask from the relative remap.

In the MPOL_PREFERRED case, the preferred node is remapped from the currently
effected nodemask to the relative nodemask.

This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Paul Jackson
7ea931c9fc mempolicy: add bitmap_onto() and bitmap_fold() operations
The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(),
with the usual cpumask and nodemask wrappers.

The bitmap_onto() operator computes one bitmap relative to another.  If the
n-th bit in the origin mask is set, then the m-th bit of the destination mask
will be set, where m is the position of the n-th set bit in the relative mask.

The bitmap_fold() operator folds a bitmap into a second that has bit m set iff
the input bitmap has some bit n set, where m == n mod sz, for the specified sz
value.

There are two substantive changes between this patch and its
predecessor bitmap_relative:
 1) Renamed bitmap_relative() to be bitmap_onto().
 2) Added bitmap_fold().

The essential motivation for bitmap_onto() is to provide a mechanism for
converting a cpuset-relative CPU or Node mask to an absolute mask.  Cpuset
relative masks are written as if the current task were in a cpuset whose CPUs
or Nodes were just the consecutive ones numbered 0..N-1, for some N.  The
bitmap_onto() operator is provided in anticipation of adding support for the
first such cpuset relative mask, by the mbind() and set_mempolicy() system
calls, using a planned flag of MPOL_F_RELATIVE_NODES.  These bitmap operators
(and their nodemask wrappers, in particular) will be used in code that
converts the user specified cpuset relative memory policy to a specific system
node numbered policy, given the current mems_allowed of the tasks cpuset.

Such cpuset relative mempolicies will address two deficiencies
of the existing interface between cpusets and mempolicies:
 1) A task cannot at present reliably establish a cpuset
    relative mempolicy because there is an essential race
    condition, in that the tasks cpuset may be changed in
    between the time the task can query its cpuset placement,
    and the time the task can issue the applicable mbind or
    set_memplicy system call.
 2) A task cannot at present establish what cpuset relative
    mempolicy it would like to have, if it is in a smaller
    cpuset than it might have mempolicy preferences for,
    because the existing interface only allows specifying
    mempolicies for nodes currently allowed by the cpuset.

Cpuset relative mempolicies are useful for tasks that don't distinguish
particularly between one CPU or Node and another, but only between how many of
each are allowed, and the proper placement of threads and memory pages on the
various CPUs and Nodes available.

The motivation for the added bitmap_fold() can be seen in the following
example.

Let's say an application has specified some mempolicies that presume 16 memory
nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset
relative) nodes 12-15.  Then lets say that application is crammed into a
cpuset that only has 8 memory nodes, 0-7.  If one just uses bitmap_onto(),
this mempolicy, mapped to that cpuset, would ignore the requested relative
nodes above 7, leaving it empty of nodes.  That's not good; better to fold the
higher nodes down, so that some nodes are included in the resulting mapped
mempolicy.  In this case, the mempolicy nodes 12-15 are taken modulo 8 (the
weight of the mems_allowed of the confining cpuset), resulting in a mempolicy
specifying nodes 4-7.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: <kosaki.motohiro@jp.fujitsu.com>
Cc: <ray-lk@madrabbit.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes
f5b087b52f mempolicy: add MPOL_F_STATIC_NODES flag
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
node remap when the policy is rebound.

Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
a union with cpuset_mems_allowed:

	struct mempolicy {
		...
		union {
			nodemask_t cpuset_mems_allowed;
			nodemask_t user_nodemask;
		} w;
	}

that stores the the nodemask that the user passed when he or she created the
mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
which is passed with any mempolicy mode, the user's passed nodemask
intersected with the VMA or task's allowed nodes is always used when
determining the preferred node, setting the MPOL_BIND zonelist, or creating
the interleave nodemask.  This happens whenever the policy is rebound,
including when a task's cpuset assignment changes or the cpuset's mems are
changed.

This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s) that
it desires.  For example, if you currently ask for an interleaved policy over
a set of nodes that you do not have access to, the mempolicy is not created
and the task continues to use the previous policy.  With this change, however,
it is possible to create the same mempolicy; it is only effected when access
to nodes in the nodemask is acquired.

It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask.  To do this, simply add "=static" immediately
following the mempolicy mode at mount time:

	mount -o remount mpol=interleave=static:1-3

Also removes mpol_check_policy() and folds its logic into mpol_new() since it
is now obsoleted.  The unused vma_mpol_equal() is also removed.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes
028fec414d mempolicy: support optional mode flags
With the evolution of mempolicies, it is necessary to support mempolicy mode
flags that specify how the policy shall behave in certain circumstances.  The
most immediate need for mode flag support is to suppress remapping the
nodemask of a policy at the time of rebind.

Both the mempolicy mode and flags are passed by the user in the 'int policy'
formal of either the set_mempolicy() or mbind() syscall.  A new constant,
MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
passed as part of this int.  Mempolicies that include illegal flags as part of
their policy are rejected as invalid.

An additional member to struct mempolicy is added to support the mode flags:

	struct mempolicy {
		...
		unsigned short policy;
		unsigned short flags;
	}

The splitting of the 'int' actual passed by the user is done in
sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
there are additional flags, and storing it in the new 'flags' member of struct
mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
the 'policy' member of the struct and all current users of pol->policy remain
unchanged.

The union of the policy mode and optional mode flags is passed back to the
user in get_mempolicy().

This combination of mode and flags within the same actual does not break
userspace code that relies on get_mempolicy(&policy, ...) and either

	switch (policy) {
	case MPOL_BIND:
		...
	case MPOL_INTERLEAVE:
		...
	};

statements or

	if (policy == MPOL_INTERLEAVE) {
		...
	}

statements.  Such applications would need to use optional mode flags when
calling set_mempolicy() or mbind() for these previously implemented statements
to stop working.  If an application does start using optional mode flags, it
will need to mask the optional flags off the policy in switch and conditional
statements that only test mode.

An additional member is also added to struct shmem_sb_info to store the
optional mode flags.

[hugh@veritas.com: shmem mpol: fix build warning]
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes
a3b51e0142 mempolicy: convert MPOL constants to enum
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
MPOL_INTERLEAVE, are better declared as part of an enum since they are
sequentially numbered and cannot be combined.

The policy member of struct mempolicy is also converted from type short to
type unsigned short.  A negative policy does not have any legitimate meaning,
so it is possible to change its type in preparation for adding optional mode
flags later.

The equivalent member of struct shmem_sb_info is also changed from int to
unsigned short.

For compatibility, the policy formal to get_mempolicy() remains as a pointer
to an int:

	int get_mempolicy(int *policy, unsigned long *nmask,
			  unsigned long maxnode, unsigned long addr,
			  unsigned long flags);

although the only possible values is the range of type unsigned short.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Pekka Enberg
1b27d05b6e mm: move cache_line_size() to <linux/cache.h>
Not all architectures define cache_line_size() so as suggested by Andrew move
the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman
19770b3260 mm: filter based on a nodemask as well as a gfp_mask
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy.  As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering.  This eliminates the need for
MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist.  I.e., pages will be allocated from the closest allowed node with
available memory.

[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman
dd1a239f6f mm: have zonelist contains structs with both a zone pointer and zone_idx
Filtering zonelists requires very frequent use of zone_idx().  This is costly
as it involves a lookup of another structure and a substraction operation.  As
the zone_idx is often required, it should be quickly accessible.  The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary.  Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
54a6eb5c47 mm: use two zonelist that are filtered by GFP mask
Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations.  Based on the zones
allowed by a gfp mask, one of these zonelists is selected.  All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists.  The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
18ea7e710d mm: remember what the preferred zone is for zone_statistics
On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign.  It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it.  This simplifies the reading of later patches in
this set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
0e88460da6 mm: introduce node_zonelist() for accessing the zonelist for a GFP mask
Introduce a node_zonelist() helper function.  It is used to lookup the
appropriate zonelist given a node and a GFP mask.  The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman
dac1d27bc8 mm: use zonelists instead of zones when direct reclaiming pages
The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags.  The patches as a set fix a bug with
regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
MPOL_BIND will apply to the two highest zones when the highest zone is
ZONE_MOVABLE.  This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking up
the appropriate zonelist for a GFP mask which simplifies patches later in the
set.

The third patch defines/remembers the "preferred zone" for numa statistics, as
it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered.  The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask.
Two zonelists exist per node, one for normal allocations and one for
__GFP_THISNODE.

Performance results varied depending on the machine configuration.  In real
workloads the gain/loss will depend on how much the userspace portion of the
benchmark benefits from having more cache available due to reduced referencing
of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.
			     loss   to  gain
Total CPU time on Kernbench: -0.86% to  1.13%
Elapsed   time on Kernbench: -0.79% to  0.76%
page_test from aim9:         -4.37% to  0.79%
brk_test  from aim9:         -0.71% to  4.07%
fork_test from aim9:         -1.84% to  4.60%
exec_test from aim9:         -0.71% to  1.08%

This patch:

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation.  Similarly, direct reclaim of pages
iterates over an array of zones.  For consistency, this patch converts direct
reclaim to use a zonelist.  No functionality is changed by this patch.  This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Nick Piggin
3c18ddd160 mm: remove nopage
Nothing in the tree uses nopage any more.  Remove support for it in the
core mm code and documentation (and a few stray references to it in
comments).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Harvey Harrison
ddc81ed2c5 remove sparse warning for mmzone.h
include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction

Calculate the offset into the node_zones array rather than the index
using casts to (char *) and comparing against the index * sizeof(struct zone).

On X86_32 this saves a sar, but code size increases by one byte per
is_highmem() use due to 32-bit cmps rather than 16 bit cmps.

Before:
 207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
 20d:   c1 f8 0b                sar    $0xb,%eax
 210:   83 f8 02                cmp    $0x2,%eax
 213:   74 16                   je     22b <kmap_atomic_prot+0x144>
 215:   83 f8 03                cmp    $0x3,%eax
 218:   0f 85 8f 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
 21e:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
 225:   0f 85 82 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
 22b:   64 a1 00 00 00 00       mov    %fs:0x0,%eax

After:
 207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
 20d:   3d 00 10 00 00          cmp    $0x1000,%eax
 212:   74 18                   je     22c <kmap_atomic_prot+0x145>
 214:   3d 00 18 00 00          cmp    $0x1800,%eax
 219:   0f 85 8f 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
 21f:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
 226:   0f 85 82 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
 22c:   64 a1 00 00 00 00       mov    %fs:0x0,%eax

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:17 -07:00
Christoph Lameter
488514d179 Remove set_migrateflags()
Migrate flags must be set on slab creation as agreed upon when the antifrag
logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
unmovable and others in the reclaimable section depending on which flag was
active when a new slab page was allocated.

This likely slid in somehow when antifrag was merged. Remove it.

The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
effect there.

Radix tree allocations are not directly reclaimable but they are allocated
with __GFP_RECLAIMABLE set on each allocation.  We now set
SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
tree slabs are consistently placed in the reclaimable section.  Radix tree
slabs will also be accounted as such.

There is then no user left of set_migratepages. So remove it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:17 -07:00
Badari Pulavarty
ea01ea937d hotplug memory remove: generic __remove_pages() support
Generic helper function to remove section mappings and sysfs entries for the
section of the memory we are removing.  offline_pages() correctly adjusted
zone and marked the pages reserved.

TODO: Yasunori Goto is working on patches to free up allocations from bootmem.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:17 -07:00
Linus Torvalds
064922a805 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (40 commits)
  [SCSI] jazz_esp, sgiwd93, sni_53c710, sun3x_esp: fix platform driver hotplug/coldplug
  [SCSI] aic7xxx: add const
  [SCSI] aic7xxx: add static
  [SCSI] aic7xxx: Update _shipped files
  [SCSI] aic7xxx: teach aicasm to not emit unused debug code/data
  [SCSI] qla2xxx: Update version number to 8.02.01-k2.
  [SCSI] qla2xxx: Correct regression in relogin code.
  [SCSI] qla2xxx: Correct misc. endian and byte-ordering issues.
  [SCSI] qla2xxx: make qla2x00_issue_iocb_timeout() static
  [SCSI] qla2xxx: qla_os.c, make 2 functions static
  [SCSI] qla2xxx: Re-register FDMI information after a LIP.
  [SCSI] qla2xxx: Correct SRB usage-after-completion/free issues.
  [SCSI] qla2xxx: Correct ISP84XX verify-chip response handling.
  [SCSI] qla2xxx: Wakeup DPC thread to process any deferred-work requests.
  [SCSI] qla2xxx: Collapse RISC-RAM retrieval code during a firmware-dump.
  [SCSI] m68k: new mac_esp scsi driver
  [SCSI] zfcp: Add some statistics provided by the FCP adapter to the sysfs
  [SCSI] zfcp: Print some messages only during ERP
  [SCSI] zfcp: Wait for free SBAL during exchange config
  [SCSI] scsi_transport_fc: fc_user_scan correction
  ...
2008-04-27 11:25:00 -07:00
Linus Torvalds
42cadc8600 Merge branch 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (147 commits)
  KVM: kill file->f_count abuse in kvm
  KVM: MMU: kvm_pv_mmu_op should not take mmap_sem
  KVM: SVM: remove selective CR0 comment
  KVM: SVM: remove now obsolete FIXME comment
  KVM: SVM: disable CR8 intercept when tpr is not masking interrupts
  KVM: SVM: sync V_TPR with LAPIC.TPR if CR8 write intercept is disabled
  KVM: export kvm_lapic_set_tpr() to modules
  KVM: SVM: sync TPR value to V_TPR field in the VMCB
  KVM: ppc: PowerPC 440 KVM implementation
  KVM: Add MAINTAINERS entry for PowerPC KVM
  KVM: ppc: Add DCR access information to struct kvm_run
  ppc: Export tlb_44x_hwater for KVM
  KVM: Rename debugfs_dir to kvm_debugfs_dir
  KVM: x86 emulator: fix lea to really get the effective address
  KVM: x86 emulator: fix smsw and lmsw with a memory operand
  KVM: x86 emulator: initialize src.val and dst.val for register operands
  KVM: SVM: force a new asid when initializing the vmcb
  KVM: fix kvm_vcpu_kick vs __vcpu_run race
  KVM: add ioctls to save/store mpstate
  KVM: Rename VCPU_MP_STATE_* to KVM_MP_STATE_*
  ...
2008-04-27 10:13:52 -07:00
Linus Torvalds
fba5c1af5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (49 commits)
  ide-tape: remove tape->merge_stage
  ide-tape: mv tape->merge_stage_size tape->merge_bh_size
  ide-tape: mv idetape_empty_write_pipeline ide_tape_flush_merge_buffer
  ide-tape: mv idetape_discard_read_pipeline ide_tape_discard_merge_buffer
  ide-tape: make __idetape_discard_read_pipeline() of type void
  ide: remove now unused ide_pci_create_host_proc()
  ide: remove /proc/ide/ali
  ide-tape: improve buffer pages freeing strategy
  ide-tape: mv tape->pages_per_stage tape->pages_per_buffer
  ide-tape: mv tape->stage_size tape->buffer_size
  ide-tape: improve buffer allocation strategy
  ide: add struct ide_io_ports (take 3)
  ide: make ide_unregister() take 'ide_hwif_t *' as an argument (take 2)
  ide: sanitize ide_unregister() usage
  mpc8xx-ide: use ide_find_port()
  ide: add "noacpi" / "acpigtf" / "acpionboot" parameters
  gayle: add "doubler" parameter
  ide: add "cdrom=" and "chs=" parameters
  ide: add "nodma|noflush|noprobe|nowerr=" parameters
  ide: remove obsoleted "hdx=autotune" kernel parameter
  ...
2008-04-27 10:13:06 -07:00
Linus Torvalds
2d630d1a68 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  mlx4_core: Add helper to move QP to ready-to-send
  mlx4_core: Add HW queues allocation helpers
  RDMA/nes: Remove volatile qualifier from struct nes_hw_cq.cq_vbase
  mlx4_core: CQ resizing should pass a 0 opcode modifier to MODIFY_CQ
  mlx4_core: Move kernel doorbell management into core
  IB/ehca: Bump version number to 0026
  IB/ehca: Make some module parameters bool, update descriptions
  IB/ehca: Remove mr_largepage parameter
  IB/ehca: Move high-volume debug output to higher debug levels
  IB/ehca: Prevent posting of SQ WQEs if QP not in RTS
  IPoIB: Handle 4K IB MTU for UD (datagram) mode
  RDMA/nes: Fix adapter reset after PXE boot
  RDMA/nes: Print IPv4 addresses in a readable format
  RDMA/nes: Use print_mac() to format ethernet addresses for printing
2008-04-27 10:10:14 -07:00
Al Viro
66c0b394f0 KVM: kill file->f_count abuse in kvm
Use kvm own refcounting instead of playing with ->filp->f_count.
That will allow to get rid of a lot of crap in anon_inode_getfd() and
kill a race in kvm_dev_ioctl_create_vm() (file might have been closed
immediately by another thread, so ->filp might point to already freed
struct file when we get around to setting it).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 18:21:46 +03:00
Hollis Blanchard
b2312f059c KVM: ppc: Add DCR access information to struct kvm_run
Device Control Registers are essentially another address space found on PowerPC
4xx processors, analogous to PIO on x86. DCRs are always 32 bits, and can be
identified by a 32-bit number. We forward most DCR accesses to userspace for
emulation (with the exception of CPR0 registers, which can be read directly
for simplicity in timebase frequency determination).

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 18:21:37 +03:00
Hollis Blanchard
76f7c87902 KVM: Rename debugfs_dir to kvm_debugfs_dir
It's a globally exported symbol now.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 18:21:36 +03:00
Marcelo Tosatti
62d9f0dbc9 KVM: add ioctls to save/store mpstate
So userspace can save/restore the mpstate during migration.

[avi: export the #define constants describing the value]
[christian: add s390 stubs]
[avi: ditto for ia64]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 18:21:16 +03:00
Alexey Dobriyan
fd0949e6e8 ide: remove now unused ide_pci_create_host_proc()
It creates files in proc with obsoleted ->get_info interface.

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:34 +02:00
Bartlomiej Zolnierkiewicz
4c3032d8a4 ide: add struct ide_io_ports (take 3)
* Add struct ide_io_ports and use it instead of `unsigned long io_ports[]`
  in ide_hwif_t.

* Rename io_ports[] in hw_regs_t to io_ports_array[].

* Use un-named union for 'unsigned long io_ports_array[]' and 'struct
  ide_io_ports io_ports' in hw_regs_t.

* Remove IDE_*_OFFSET defines.

v2:
* scc_pata.c build fix from Stephen Rothwell.

v3:
* Fix ctl_adrr typo in Sparc-specific part of ns87415.c.
  (Noticed by Andrew Morton)

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:32 +02:00
Bartlomiej Zolnierkiewicz
387750c3bf ide: make ide_unregister() take 'ide_hwif_t *' as an argument (take 2)
* Make ide_unregister() take 'ide_hwif_t *hwif' instead of 'unsigned int
  index' (hwif->index) as an argument and update all users accordingly.

While at it:

* Remove unnecessary checks for hwif != NULL from ide-pnp.c::idepnp_remove()
  and delkin_cb.c::delkin_cb_remove().

* Remove needless hwif->chipset assignment from scc_pata.c::scc_remove().

v2:
* Fixup ide_unregister() documentation.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:31 +02:00
Bartlomiej Zolnierkiewicz
1dbfeb4bc8 ide: add "noacpi" / "acpigtf" / "acpionboot" parameters
* Rename ide_noacpi{tfs,onboot} to ide_acpi{gtf,onboot} (+ reverse logic).

* Move ide_*acpi* variables to ide-acpi.c and remove unnecessary initializers.

* Add "noacpi" / "acpigtf" / "acpionboot" parameters.

* Obsolete "ide=noacpi" / "ide=acpigtf" / "ide=acpionboot" kernel parameters.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:30 +02:00
Bartlomiej Zolnierkiewicz
207daeaabb ide: remove obsoleted "hdx=autotune" kernel parameter
* Remove obsoleted "hdx=autotune" kernel parameter
  (we always auto-tune PIO if possible nowadays).

* Remove no longer needed ide_drive_t.autotune flag.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:29 +02:00
Bartlomiej Zolnierkiewicz
e160124ff6 ide: remove IDE_HFLAG_NO_AUTOTUNE host flag
* Don't set IDE_HFLAG_NO_AUTOTUNE host flag in sgiioc4 and icside
  host drivers - there is no need for it as they don't implement
  ->set_pio_mode method.

* Remove no longer needed IDE_HFLAG_NO_AUTOTUNE host flag.

There should be no functional changes caused by this patch.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:29 +02:00
Bartlomiej Zolnierkiewicz
ebae41a5a0 ide: add "vlb|pci_clock=" parameter
* Add "vlb_clock=" parameter for specifying VLB clock frequency (in MHz).

* Add "pci_clock=" parameter for specifying PCI bus clock frequency (in MHz).

While at it:

* qd65xx.c: rename {active,recovery}_cycle variables to {act,rec}_cyc.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:29 +02:00
Bartlomiej Zolnierkiewicz
cc12175ff2 ide: remove obsoleted "hdx=noautotune" kernel parameter
Remove obsoleted "hdx=noautotune" kernel parameter
(it has been obsoleted since 1 Nov 2004).

Then make ide_hwif_t.autotune a single bit flag
and remove no longer needed IDE_TUNE_* defines.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:24 +02:00
Bartlomiej Zolnierkiewicz
e460a59751 ide: remove obsoleted "idex=reset" kernel parameter
Remove obsoleted "idex=reset" kernel parameter
(it has been obsoleted since 1 Nov 2004).

Then remove corresponding code from ide_probe_port()
and no longer used ->reset field from ide_hwif_t.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:24 +02:00
Bartlomiej Zolnierkiewicz
9fd91d959f ide: add "ignore_cable" parameter (take 2)
Add "ignore_cable" parameter:

* "ide_core.ignore_cable=[interface_number]" boot option if IDE is built-in
  (i.e. "ide_core.ignore_cable=1" to force ignoring cable for "ide1")

* "ignore_cable=[interface_number]" module parameter (for ide_core module)
  if IDE is compiled as module

v2:
* Add ide_port_apply_params() helper
  - use it in ide_device_add_all() and ide_scan_port().

* Make it possible to later disable ignoring cable detection by passing
  "[interface_number]:0" to /sys/module/ide_core/parameters/ignore_cable
  (however sysfs interface is not enabled yet since it needs some other
   IDE changes to make it work reliable).

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-27 15:38:23 +02:00
Marcelo Tosatti
3d80840d96 KVM: hlt emulation should take in-kernel APIC/PIT timers into account
Timers that fire between guest hlt and vcpu_block's add_wait_queue() are
ignored, possibly resulting in hangs.

Also make sure that atomic_inc and waitqueue_active tests happen in the
specified order, otherwise the following race is open:

CPU0                                        CPU1
                                            if (waitqueue_active(wq))
add_wait_queue()
if (!atomic_read(pit_timer->pending))
    schedule()
                                            atomic_inc(pit_timer->pending)

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:04:11 +03:00
Feng(Eric) Liu
d4c9ff2d1b KVM: Add kvm trace userspace interface
This interface allows user a space application to read the trace of kvm
related events through relayfs.

Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:01:22 +03:00
Feng (Eric) Liu
2714d1d3d6 KVM: Add trace markers
Trace markers allow userspace to trace execution of a virtual machine
in order to monitor its performance.

Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:01:19 +03:00
Anthony Liguori
35149e2129 KVM: MMU: Don't assume struct page for x86
This patch introduces a gfn_to_pfn() function and corresponding functions like
kvm_release_pfn_dirty().  Using these new functions, we can modify the x86
MMU to no longer assume that it can always get a struct page for any given gfn.

We don't want to eliminate gfn_to_page() entirely because a number of places
assume they can do gfn_to_page() and then kmap() the results.  When we support
IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will
succeed.

This does not implement support for avoiding reference counting for reserved
RAM or for IO memory.  However, it should make those things pretty straight
forward.

Since we're only introducing new common symbols, I don't think it will break
the non-x86 architectures but I haven't tested those.  I've tested Intel,
AMD, NPT, and hugetlbfs with Windows and Linux guests.

[avi: fix overflow when shifting left pfns by adding casts]

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:01:15 +03:00
Izik Eidus
d39f13b0da KVM: add vm refcounting
the main purpose of adding this functions is the abilaty to release the
spinlock that protect the kvm list while still be able to do operations
on a specific kvm in a safe way.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:56 +03:00
Christian Borntraeger
e28acfea5d KVM: s390: intercepts for diagnose instructions
This patch introduces interpretation of some diagnose instruction intercepts.
Diagnose is our classic architected way of doing a hypercall. This patch
features the following diagnose codes:
- vm storage size, that tells the guest about its memory layout
- time slice end, which is used by the guest to indicate that it waits
  for a lock and thus cannot use up its time slice in a useful way
- ipl functions, which a guest can use to reset and reboot itself

In order to implement ipl functions, we also introduce an exit reason that
causes userspace to perform various resets on the virtual machine. All resets
are described in the principles of operation book, except KVM_S390_RESET_IPL
which causes a reboot of the machine.

Acked-by: Martin Schwidefsky <martin.schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:46 +03:00
Carsten Otte
ba5c1e9b6c KVM: s390: interrupt subsystem, cpu timer, waitpsw
This patch contains the s390 interrupt subsystem (similar to in kernel apic)
including timer interrupts (similar to in-kernel-pit) and enabled wait
(similar to in kernel hlt).

In order to achieve that, this patch also introduces intercept handling
for instruction intercepts, and it implements load control instructions.

This patch introduces an ioctl KVM_S390_INTERRUPT which is valid for both
the vm file descriptors and the vcpu file descriptors. In case this ioctl is
issued against a vm file descriptor, the interrupt is considered floating.
Floating interrupts may be delivered to any virtual cpu in the configuration.

The following interrupts are supported:
SIGP STOP       - interprocessor signal that stops a remote cpu
SIGP SET PREFIX - interprocessor signal that sets the prefix register of a
                  (stopped) remote cpu
INT EMERGENCY   - interprocessor interrupt, usually used to signal need_reshed
                  and for smp_call_function() in the guest.
PROGRAM INT     - exception during program execution such as page fault, illegal
                  instruction and friends
RESTART         - interprocessor signal that starts a stopped cpu
INT VIRTIO      - floating interrupt for virtio signalisation
INT SERVICE     - floating interrupt for signalisations from the system
                  service processor

struct kvm_s390_interrupt, which is submitted as ioctl parameter when injecting
an interrupt, also carrys parameter data for interrupts along with the interrupt
type. Interrupts on s390 usually have a state that represents the current
operation, or identifies which device has caused the interruption on s390.

kvm_s390_handle_wait() does handle waitpsw in two flavors: in case of a
disabled wait (that is, disabled for interrupts), we exit to userspace. In case
of an enabled wait we set up a timer that equals the cpu clock comparator value
and sleep on a wait queue.

[christian: change virtio interrupt to 0x2603]

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:44 +03:00
Christian Borntraeger
8f2abe6a1e KVM: s390: sie intercept handling
This path introduces handling of sie intercepts in three flavors: Intercepts
are either handled completely in-kernel by kvm_handle_sie_intercept(),
or passed to userspace with corresponding data in struct kvm_run in case
kvm_handle_sie_intercept() returns -ENOTSUPP.
In case of partial execution in kernel with the need of userspace support,
kvm_handle_sie_intercept() may choose to set up struct kvm_run and return
-EREMOTE.

The trivial intercept reasons are handled in this patch:
handle_noop() just does nothing for intercepts that don't require our support
  at all
handle_stop() is called when a cpu enters stopped state, and it drops out to
  userland after updating our vcpu state
handle_validity() faults in the cpu lowcore if needed, or passes the request
  to userland

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:43 +03:00
Heiko Carstens
b0c632db63 KVM: s390: arch backend for the kvm kernel module
This patch contains the port of Qumranet's kvm kernel module to IBM zSeries
 (aka s390x, mainframe) architecture. It uses the mainframe's virtualization
instruction SIE to run virtual machines with up to 64 virtual CPUs each.
This port is only usable on 64bit host kernels, and can only run 64bit guest
kernels. However, running 31bit applications in guest userspace is possible.

The following source files are introduced by this patch
arch/s390/kvm/kvm-s390.c    similar to arch/x86/kvm/x86.c, this implements all
                            arch callbacks for kvm. __vcpu_run calls back into
                            sie64a to enter the guest machine context
arch/s390/kvm/sie64a.S      assembler function sie64a, which enters guest
                            context via SIE, and switches world before and after                            that
include/asm-s390/kvm_host.h contains all vital data structures needed to run
                            virtual machines on the mainframe
include/asm-s390/kvm.h      defines kvm_regs and friends for user access to
                            guest register content
arch/s390/kvm/gaccess.h     functions similar to uaccess to access guest memory
arch/s390/kvm/kvm-s390.h    header file for kvm-s390 internals, extended by
                            later patches

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:42 +03:00
Carsten Otte
402b08622d s390: KVM preparation: provide hook to enable pgstes in user pagetable
The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:40 +03:00
Avi Kivity
69a9f69bb2 KVM: Move some x86 specific constants and structures to include/asm-x86
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:34 +03:00
Christian Borntraeger
97646202bc KVM: kvm.h: __user requires compiler.h
include/linux/kvm.h defines struct kvm_dirty_log to
	[...]
	union {
		void __user *dirty_bitmap; /* one bit per page */
		__u64 padding;
	};

__user requires compiler.h to compile. Currently, this works on x86
only coincidentally due to other include files. This patch makes
kvm.h compile in all cases.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:32 +03:00
Marcelo Tosatti
2f333bcb4e KVM: MMU: hypercall based pte updates and TLB flushes
Hypercall based pte updates are faster than faults, and also allow use
of the lazy MMU mode to batch operations.

Don't report the feature if two dimensional paging is enabled.

[avi:
 - one mmu_op hypercall instead of one per op
 - allow 64-bit gpa on hypercall
 - don't pass host errors (-ENOMEM) to guest]

[akpm: warning fix on i386]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:27 +03:00
Marcelo Tosatti
0cf1bfd273 x86: KVM guest: add basic paravirt support
Add basic KVM paravirt support. Avoid vm-exits on IO delays.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:25 +03:00
Marcelo Tosatti
a28e4f5a62 KVM: add basic paravirt support
Add basic KVM paravirt support. Avoid vm-exits on IO delays.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:24 +03:00
Sheng Yang
e0f63cb927 KVM: Add save/restore supporting of in kernel PIT
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:22 +03:00
Sheng Yang
7837699fa6 KVM: In kernel PIT model
The patch moves the PIT model from userspace to kernel, and increases
the timer accuracy greatly.

[marcelo: make last_injected_time per-guest]

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-and-Acked-by: Alex Davis <alex14641@yahoo.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:21 +03:00
Joerg Roedel
71c4dfafc0 KVM: detect if VCPU triple faults
In the current inject_page_fault path KVM only checks if there is another PF
pending and injects a DF then. But it has to check for a pending DF too to
detect a shutdown condition in the VCPU.  If this is not detected the VCPU goes
to a PF -> DF -> PF loop when it should triple fault. This patch detects this
condition and handles it with an KVM_SHUTDOWN exit to userspace.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:27 +03:00
Marcelo Tosatti
05da45583d KVM: MMU: large page support
Create large pages mappings if the guest PTE's are marked as such and
the underlying memory is hugetlbfs backed.  If the largepage contains
write-protected pages, a large pte is not used.

Gives a consistent 2% improvement for data copies on ram mounted
filesystem, without NPT/EPT.

Anthony measures a 4% improvement on 4-way kernbench, with NPT.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:25 +03:00
Marcelo Tosatti
2e53d63acb KVM: MMU: ignore zapped root pagetables
Mark zapped root pagetables as invalid and ignore such pages during lookup.

This is a problem with the cr3-target feature, where a zapped root table fools
the faulting code into creating a read-only mapping. The result is a lockup
if the instruction can't be emulated.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:25 +03:00
Avi Kivity
ef2979bd98 KVM: Increase the number of user memory slots per vm
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:23 +03:00
Avi Kivity
a988b910ef KVM: Add API for determining the number of supported memory slots
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:23 +03:00
Avi Kivity
edbe6c325d KVM: Increase vcpu count to 16
With NPT support, scalability is much improved.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:23 +03:00
Avi Kivity
f725230af9 KVM: Add API to retrieve the number of supported vcpus per vm
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:23 +03:00
Glauber de Oliveira Costa
18068523d3 KVM: paravirtualized clocksource: host part
This is the host part of kvm clocksource implementation. As it does
not include clockevents, it is a fairly simple implementation. We
only have to register a per-vcpu area, and start writing to it periodically.

The area is binary compatible with xen, as we use the same shadow_info
structure.

[marcelo: fix bad_page on MSR_KVM_SYSTEM_TIME]
[avi: save full value of the msr, even if enable bit is clear]
[avi: clear previous value of time_page]

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:22 +03:00
Hollis Blanchard
31bb117eb4 KVM: Use CONFIG_PREEMPT_NOTIFIERS around struct preempt_notifier
This allows kvm_host.h to be #included even when struct preempt_notifier is
undefined. This is needed to build ppc asm-offsets.h.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:18 +03:00
Linus Torvalds
c3bf9bc243 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3:
  x86_64/mm: check and print vmemmap allocation continuous
  x86_64: fix setup_node_bootmem to support big mem excluding with memmap
  x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
  mm: allow reserve_bootmem() cross nodes
  mm: offset align in alloc_bootmem()
  mm: fix alloc_bootmem_core to use fast searching for all nodes
  mm: make mem_map allocation continuous
2008-04-26 14:04:32 -07:00
Yinghai Lu
c2b91e2eec x86_64/mm: check and print vmemmap allocation continuous
On big systems with lots of memory, don't print out too much during
bootup, and make it easy to find if it is continuous.

on 256G 8 sockets system will get
 [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
[ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
 [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
[ffffe20038700000-ffffe200387fffff] potential offnode page_structs
 [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
[ffffe20054700000-ffffe200547fffff] potential offnode page_structs
 [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
[ffffe20070700000-ffffe200707fffff] potential offnode page_structs
 [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
[ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
 [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
[ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
 [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
[ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
 [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7

instead of a very long print out...

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-26 22:51:09 +02:00
Linus Torvalds
9b79ed952b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-generic-bitops-v3
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-generic-bitops-v3:
  x86, bitops: select the generic bitmap search functions
  x86: include/asm-x86/pgalloc.h/bitops.h: checkpatch cleanups - formatting only
  x86: finalize bitops unification
  x86, UML: remove x86-specific implementations of find_first_bit
  x86: optimize find_first_bit for small bitmaps
  x86: switch 64-bit to generic find_first_bit
  x86: generic versions of find_first_(zero_)bit, convert i386
  bitops: use __fls for fls64 on 64-bit archs
  generic: implement __fls on all 64-bit archs
  generic: introduce a generic __fls implementation
  x86: merge the simple bitops and move them to bitops.h
  x86, generic: optimize find_next_(zero_)bit for small constant-size bitmaps
  x86, uml: fix uml with generic find_next_bit for x86
  x86: change x86 to use generic find_next_bit
  uml: Kconfig cleanup
  uml: fix build error
2008-04-26 13:46:11 -07:00
Bartlomiej Zolnierkiewicz
f37afdaca7 ide: constify struct ide_dma_ops
* Export ide_dma_exec_cmd() and __ide_dma_test_irq().

* Constify struct ide_dma_ops.

* Always set hwif->dma_ops to &sff_dma_ops in ide_setup_dma()
  (it is later overriden by ide_init_port() if needed) and drop
  'const struct ide_port_info *d' argument.

While at it:

* Rename __ide_dma_test_irq() to ide_dma_test_irq().

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:24 +02:00
Bartlomiej Zolnierkiewicz
5e37bdc081 ide: add struct ide_dma_ops (take 3)
Add struct ide_dma_ops and convert core code + drivers to use it.

While at it:

* Drop "ide_" prefix from ->ide_dma_end and ->ide_dma_test_irq methods.

* Drop "ide_" "infixes" from DMA methods.

* au1xxx-ide.c:
  - use auide_dma_{test_irq,end}() directly in auide_dma_timeout()

* pdc202xx_old.c:
  - drop "old_" "infixes" from DMA methods

* siimage.c:
  - add siimage_dma_test_irq() helper
  - print SATA warning in siimage_init_one()

* Remove no longer needed ->init_hwif implementations.

v2:
* Changes based on review from Sergei:
  - s/siimage_ide_dma_test_irq/siimage_dma_test_irq/
  - s/drive->hwif/hwif/ in idefloppy_pc_intr().
  - fix patch description w.r.t. au1xxx-ide changes
  - fix au1xxx-ide build
  - fix naming for cmd64*_dma_ops
  - drop "ide_" and "old_" infixes
  - s/hpt3xxx_dma_ops/hpt37x_dma_ops/
  - s/hpt370x_dma_ops/hpt370_dma_ops/
  - use correct DMA ops for HPT302/N, HPT371/N and HPT374
  - s/it821x_smart_dma_ops/it821x_pass_through_dma_ops/

v3:
* Two bugs slipped in v2 (noticed by Sergei):
  - use correct DMA ops for HPT374 (for real this time)
  - handle HPT370/HPT370A properly

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:24 +02:00
Bartlomiej Zolnierkiewicz
1fd1890594 ide: add IDE_HFLAG_SERIALIZE_DMA host flag
* Add IDE_HFLAG_SERIALIZE_DMA host flag to serialize ports
  if DMA is available and handle it in ide_init_port().

* Convert sl82c105 host driver to use this new flag.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:24 +02:00
Bartlomiej Zolnierkiewicz
b123f56e04 ide: do complete DMA setup in ->init_dma method (take 2)
* Make ide_hwif_setup_dma() return an error value.

* Pass 'const struct ide_port_info *d' instead of 'unsigned long dmabase'
  to ->init_dma method and make it return an error value.

* Rename ide_get_or_set_dma_base() to ide_pci_dma_base(),
  change ordering of its arguments and then export it.

* Export ide_pci_set_master().

* Do complete DMA setup inside ->init_dma method and update ->init_dma
  users accordingly.

* Sanitize code for DMA setup in ide_init_port().

v2:
* Fix for CONFIG_BLK_DEV_IDEDMA_PCI=n configs
  (from Jiri Slaby <jirislaby@gmail.com>):

  Fix following compiler warning by returning EINVAL:

  In file included from ANYTHING-INCLUDING-IDE.H:45:
  include/linux/ide.h: In function ‘ide_hwif_setup_dma’:
  include/linux/ide.h:1022: warning: no return statement in function returning non-void

Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:22 +02:00
Bartlomiej Zolnierkiewicz
b8e73fba60 ide: export ide_allocate_dma_engine()
Export ide_allocate_dma_engine() and use it in trm290 host driver.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:21 +02:00
Bartlomiej Zolnierkiewicz
5e59c23684 ide: remove ->cds field from ide_hwif_t (take 2)
* Use hwif->name instead of cds->name in ide_allocate_dma_engine().

* Use pci_name(dev) instead of cds->name in init_dma_pdc202xx().

* Remove no longer needed ->cds field from ide_hwif_t.

v2:

* scc_pata.c also needs to be updated now (noticed by Stephen Rothwell).

There should be no functional changes caused by this patch.

Cc: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
Cc: Akira Iguchi <akira2.iguchi@toshiba.co.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:20 +02:00
Bartlomiej Zolnierkiewicz
21a3387ddd ide: remove ->extra field from struct ide_port_info
Always setup hwif->extra_base in ide_iomio_dma() and remove
no longer needed ->extra field from struct ide_port_info.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:20 +02:00
Bartlomiej Zolnierkiewicz
5add222417 ide: remove ide_hwif_request_regions()
Remove no longer used ide_hwif_request_regions() and hwif_request_region().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:19 +02:00
Bartlomiej Zolnierkiewicz
0d1bad216c ide: manage resources for PCI devices in ide_pci_enable() (take 3)
* Reserve PCI BARs 0-3 (0-1 for single port controllers) in
  ide_pci_enable() and remove ide_hwif_request_regions() call
  from ide_device_add_all() (also cleanup resource management
  in scc_pata host driver).

* Fix handling of PCI BAR 4 in ide_pci_enable(), then cleanup
  ide_iomio_dma() (+ init_hwif_trm290() in trm290 host driver)
  and remove ide_release[_iomio]_dma().

v2:
* Fixup trm290 host driver.

v3:
* Because of scc_pata host driver changes we need to call
  pci_request_selected_regions() also in setup_mmio_scc().

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:19 +02:00
Bartlomiej Zolnierkiewicz
d083c03f25 ide: remove ide_hwif_release_regions()
All host drivers using ide_unregister()/module_exit() have been fixed
to manage resources themselves so this function can be removed now.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:17 +02:00
Bartlomiej Zolnierkiewicz
0bfeee7d41 ide: use ide_legacy_device_add() for qd65xx (take 2)
* Add 'unsigned long config' argument to ide_legacy_device_add()
  for setting hwif->config_data.

* Use ide_find_port_slot() instead of ide_find_port() in
  ide_legacy_device_add().

* Handle IDE_HFLAG_QD_2ND_PORT and IDE_HFLAG_SINGLE host flags in
  ide_legacy_device_add().

* Convert qd65xx host driver to use ide_legacy_device_add().

v2:
* Update ali14xx, dtc2278, ht6560b and umc8672 host drivers.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:16 +02:00
Bartlomiej Zolnierkiewicz
3b36f66b81 ide: add ide_legacy_device_add() helper
Add ide_legacy_device_add() helper for use by legacy VLB host drivers
(+ convert them to use it).

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:16 +02:00
Bartlomiej Zolnierkiewicz
e53cd458d5 ide: remove ->noprobe field from ide_hwif_t
Update IDE PMAC host driver to use drive->noprobe instead of hwif->noprobe
and remove hwif->noprobe completely (it is always set to zero now).

There should be no functional changes caused by this patch.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:16 +02:00
Bartlomiej Zolnierkiewicz
ac95beedf8 ide: add struct ide_port_ops (take 2)
* Move hooks for port/host specific methods from ide_hwif_t to
  'struct ide_port_ops'.

* Add 'const struct ide_port_ops *port_ops' to 'struct ide_port_info'
  and ide_hwif_t.

* Update host drivers and core code accordingly.

While at it:

* Rename ata66_*() cable detect functions to *_cable_detect() to match
  the standard naming. (Suggested by Sergei Shtylyov)

v2:
* Fix build for bast-ide. (Noticed by Andrew Morton)

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 22:25:14 +02:00
Alexander van Heukelum
3a48305028 x86: optimize find_first_bit for small bitmaps
Avoid a call to find_first_bit if the bitmap size is know at
compile time and small enough to fit in a single long integer.
Modeled after an optimization in the original x86_64-specific
code.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 19:21:17 +02:00
Alexander van Heukelum
77b9bd9c49 x86: generic versions of find_first_(zero_)bit, convert i386
Generic versions of __find_first_bit and __find_first_zero_bit
are introduced as simplified versions of __find_next_bit and
__find_next_zero_bit. Their compilation and use are guarded by
a new config variable GENERIC_FIND_FIRST_BIT.

The generic versions of find_first_bit and find_first_zero_bit
are implemented in terms of the newly introduced __find_first_bit
and __find_first_zero_bit.

This patch does not remove the i386-specific implementation,
but it does switch i386 to use the generic functions by setting
GENERIC_FIND_FIRST_BIT=y for X86_32.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 19:21:16 +02:00
Alexander van Heukelum
64970b68d2 x86, generic: optimize find_next_(zero_)bit for small constant-size bitmaps
This moves an optimization for searching constant-sized small
bitmaps form x86_64-specific to generic code.

On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
changes with this applied. I have observed only four places where this
optimization avoids a call into find_next_bit:

In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
In __next_cpu a call is avoided for a 32-bit bitmap. That's it.

On x86_64, 52 locations are optimized with a minimal increase in
code size:

Current #testing defconfig:
	146 x bsf, 27 x find_next_*bit
   text    data     bss     dec     hex filename
   5392637  846592  724424 6963653  6a41c5 vmlinux

After removing the x86_64 specific optimization for find_next_*bit:
	94 x bsf, 79 x find_next_*bit
   text    data     bss     dec     hex filename
   5392358  846592  724424 6963374  6a40ae vmlinux

After this patch (making the optimization generic):
	146 x bsf, 27 x find_next_*bit
   text    data     bss     dec     hex filename
   5392396  846592  724424 6963412  6a40d4 vmlinux

[ tglx@linutronix.de: build fixes ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 19:21:16 +02:00
Linus Torvalds
d485cb9aa2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-optimized-inlining
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-optimized-inlining:
  generic: make optimized inlining arch-opt-in
  x86: add optimized inlining
2008-04-26 09:48:52 -07:00
Ingo Molnar
765c68bd54 generic: make optimized inlining arch-opt-in
Stephen Rothwell reported that linux-next did not build on powerpc64.

make optimized inlining dependent on architecture opt-in.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 17:44:55 +02:00
Ingo Molnar
60a3cdd063 x86: add optimized inlining
add CONFIG_OPTIMIZE_INLINING=y.

allow gcc to optimize the kernel image's size by uninlining
functions that have been marked 'inline'. Previously gcc was
forced by Linux to always-inline these functions via a gcc
attribute:

 #define inline	inline __attribute__((always_inline))

Especially when the user has already selected
CONFIG_OPTIMIZE_FOR_SIZE=y this can make a huge difference in
kernel image size (using a standard Fedora .config):

   text    data     bss     dec           hex filename
   5613924  562708 3854336 10030968    990f78 vmlinux.before
   5486689  562708 3854336  9903733    971e75 vmlinux.after

that's a 2.3% text size reduction (!).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 17:44:55 +02:00
Bartlomiej Zolnierkiewicz
05230e23cf ide: remove hwif->straight8 flag
All host drivers now either set hwif->mmio or reserve continuous
I/O resources so remove no longer needed hwif->straight8 flag
and never reached code for 'hwif->straight8 == 0' case.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:39 +02:00
Bartlomiej Zolnierkiewicz
951784b667 ide: remove IDE_HFLAG_CY82C693 host flag
Sergei suggested that it shouldn't be necessary + it had no effect
anyway since ide_id_dma_bug() is called earlier in ide_tune_dma().

Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:38 +02:00
Adrian Bunk
b3a37f1284 remove include/linux/hdsmart.h
include/linux/hdsmart.h is not used by the kernel and should therefore
be removed.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "Robert P. J. Day" <rpjday@crashcourse.ca>,
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:37 +02:00
Bartlomiej Zolnierkiewicz
e277f91fef ide: use ide_find_port() in legacy VLB host drivers (take 2)
* Add IDE_HFLAG_QD_2ND_PORT host flag to indicate the need of skipping
  first ide_hwifs[] slot for the second port of QD65xx controller.

* Handle this new host flag in ide_find_port_slot().

* Convert legacy VLB host drivers to use ide_find_port().

While at it:

* Fix couple of printk()-s in qd65xx host driver to not use hwif->name.

v2:
* Fix qd65xx.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:36 +02:00
Bartlomiej Zolnierkiewicz
fe80b937c9 ide: merge ide_match_hwif() and ide_find_port()
* Change ide_match_hwif() argument from 'u8 bootable' to
  'struct ide_port_info *d'.

* Move ide_match_hwif() to ide-probe.c from setup-pci.c and rename
  it to ide_find_port_slot().  Update some comments while at it.

* ide_find_port() can be now just a wrapper for ide_find_port_slot().

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:36 +02:00
Bartlomiej Zolnierkiewicz
078fdf789c ide: remove PIO "downgrade" quirk
No need for it nowadays so remove quirk code from ide_get_best_pio_mode()
and IDE_HFLAG_PIO_DOWNGRADE host flag.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:36 +02:00
Bartlomiej Zolnierkiewicz
5e71d9c5a5 ide: IDE_HFLAG_BOOTABLE -> IDE_HFLAG_NON_BOOTABLE
"bootable" should be the default behavior so replace
IDE_HFLAG_BOOTABLE host flag with IDE_HFLAG_NON_BOOTABLE.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:35 +02:00
Bartlomiej Zolnierkiewicz
59bff5ba55 ide: cleanup ide_find_port()
Remove no longer needed matching against I/O base and 'base' argument.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-26 17:36:32 +02:00
Linus Torvalds
bc84e0a160 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [PATCH] sanitize locate_fd()
  [PATCH] sanitize unshare_files/reset_files_struct
  [PATCH] sanitize handling of shared descriptor tables in failing execve()
  [PATCH] close race in unshare_files()
  [PATCH] restore sane ->umount_begin() API
  cifs: timeout dfs automounts +little fix.
2008-04-25 19:05:55 -07:00
Yevgeny Petrilin
ed4d3c1061 mlx4_core: Add helper to move QP to ready-to-send
Avoid duplicating code in ethernet and FC modules.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-25 14:52:32 -07:00
Yevgeny Petrilin
38ae6a5354 mlx4_core: Add HW queues allocation helpers
Wrap doorbell, buffer and MTT allocation in helper functions for
ethernet and FC modules to use.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-04-25 14:27:08 -07:00
Linus Torvalds
b9fa38f75e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (49 commits)
  [POWERPC] Add zImage.iseries to arch/powerpc/boot/.gitignore
  [POWERPC] bootwrapper: fix build error on virtex405-head.S
  [POWERPC] 4xx: Fix 460GT support to not enable FPU
  [POWERPC] 4xx: Add NOR FLASH entries to Canyonlands and Glacier dts
  [POWERPC] Xilinx: of_serial support for Xilinx uart 16550.
  [POWERPC] Xilinx: boot support for Xilinx uart 16550.
  [POWERPC] celleb: Add support for PCI Express
  [POWERPC] celleb: Move miscellaneous files for Beat
  [POWERPC] celleb: Move a file for SPU on Beat
  [POWERPC] celleb: Move files for Beat mmu and iommu
  [POWERPC] celleb: Move files for Beat hvcall interfaces
  [POWERPC] celleb: Move the SCC related code for celleb
  [POWERPC] celleb: Move the files for celleb base support
  [POWERPC] celleb: Consolidate io-workarounds code
  [POWERPC] cell: Generalize io-workarounds code
  [POWERPC] Add CONFIG_PPC_PSERIES_DEBUG to enable debugging for platforms/pseries
  [POWERPC] Convert from DBG() to pr_debug() in platforms/pseries/
  [POWERPC] Register udbg console early on pseries LPAR
  [POWERPC] Mark udbg console as CON_ANYTIME, ie. callable early in boot
  [POWERPC] Set udbg_console index to 0
  ...
2008-04-25 12:52:16 -07:00