For all architectures, this just means that you'll see a "Memory Model"
choice in your architecture menu. For those that implement DISCONTIGMEM,
you may eventually want to make your ARCH_DISCONTIGMEM_ENABLE a "def_bool
y" and make your users select DISCONTIGMEM right out of the new choice
menu. The only disadvantage might be if you have some specific things that
you need in your help option to explain something about DISCONTIGMEM.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch effectively eliminates direct use of pgdat->node_mem_map outside
of the DISCONTIG code. On a flat memory system, these fields aren't
currently used, neither are they on a sparsemem system.
There was also a node_mem_map(nid) macro on many architectures. Its use
along with the use of ->node_mem_map itself was not consistent. It has
been removed in favor of two new, more explicit, arch-independent macros:
pgdat_page_nr(pgdat, pagenr)
nid_page_nr(nid, pagenr)
I called them "pgdat" and "nid" because we overload the term "node" to mean
"NUMA node", "DISCONTIG node" or "pg_data_t" in very confusing ways. I
believe the newer names are much clearer.
These macros can be overridden in the sparsemem case with a theoretically
slower operation using node_start_pfn and pfn_to_page(), instead. We could
make this the only behavior if people want, but I don't want to change too
much at once. One thing at a time.
This patch removes more code than it adds.
Compile tested on alpha, alpha discontig, arm, arm-discontig, i386, i386
generic, NUMAQ, Summit, ppc64, ppc64 discontig, and x86_64. Full list
here: http://sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/configs/
Boot tested on NUMAQ, x86 SMP and ppc64 power4/5 LPARs.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin J. Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The SGI IOC4 I/O controller chip drivers are currently all configured by
CONFIG_BLK_DEV_SGIIOC4. This is undesirable as not all IOC4 hardware features
are needed by all systems.
This patch adds two configuration variables, CONFIG_SGI_IOC4 for core IOC4
driver support (see patch 1/3 in this series for further explanation) and
CONFIG_SERIAL_SGI_IOC4 to independently enable serial port support.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Acked-by: Pat Gefre <pfg@sgi.com>
Acked-by: Jeremy Higdon <jeremy@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the bits to make the XPC code use the uncached
allocator rather than calling into the mspec driver. It also includes the
mspec.h header which is required to build the XPC modules.
Signed-off-by: Jes Sorensen <jes@wildopensource.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the ia64 uncached page allocator and the generic
allocator (genalloc). The uncached allocator was formerly part of the SN2
mspec driver but there are several other users of it so it has been split
off from the driver.
The generic allocator can be used by device driver to manage special memory
etc. The generic allocator is based on the allocator from the sym53c8xx_2
driver.
Various users on ia64 needs uncached memory. The SGI SN architecture requires
it for inter-partition communication between partitions within a large NUMA
cluster. The specific user for this is the XPC code. Another application is
large MPI style applications which use it for synchronization, on SN this can
be done using special 'fetchop' operations but it also benefits non SN
hardware which may use regular uncached memory for this purpose. Performance
of doing this through uncached vs cached memory is pretty substantial. This
is handled by the mspec driver which I will push out in a seperate patch.
Rather than creating a specific allocator for just uncached memory I came up
with genalloc which is a generic purpose allocator that can be used by device
drivers and other subsystems as they please. For instance to handle onboard
device memory. It was derived from the sym53c7xx_2 driver's allocator which
is also an example of a potential user (I am refraining from modifying sym2
right now as it seems to have been under fairly heavy development recently).
On ia64 memory has various properties within a granule, ie. it isn't safe to
access memory as uncached within the same granule as currently has memory
accessed in cached mode. The regular system therefore doesn't utilize memory
in the lower granules which is mixed in with device PAL code etc. The
uncached driver walks the EFI memmap and pulls out the spill uncached pages
and sticks them into the uncached pool. Only after these chunks have been
utilized, will it start converting regular cached memory into uncached memory.
Hence the reason for the EFI related code additions.
Signed-off-by: Jes Sorensen <jes@wildopensource.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A lot of the code in arch/*/mm/hugetlbpage.c is quite similar. This patch
attempts to consolidate a lot of the code across the arch's, putting the
combined version in mm/hugetlb.c. There are a couple of uglyish hacks in
order to covert all the hugepage archs, but the result is a very large
reduction in the total amount of code. It also means things like hugepage
lazy allocation could be implemented in one place, instead of six.
Tested, at least a little, on ppc64, i386 and x86_64.
Notes:
- this patch changes the meaning of set_huge_pte() to be more
analagous to set_pte()
- does SH4 need s special huge_ptep_get_and_clear()??
Acked-by: William Lee Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the core of the (much simplified) early reclaim. The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.
One of the major uses of this is NUMA machines. With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.
This adds a zone tuneable to enable early zone reclaim. It is selected on a
per-zone basis and is turned on/off via syscall.
Adding some extra throttling on the reclaim was also required (patch
4/4). Without the machine would grind to a crawl when doing a "make -j"
kernel build. Even with this patch the System Time is higher on
average, but it seems tolerable. Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
wall user sys %cpu ctx sw. sleeps
---- ---- --- ---- ------ ------
No patch 1009 1384 847 258 298170 504402
w/patch, no reclaim 880 1376 667 288 254064 396745
w/patch & reclaim 1079 1385 926 252 291625 548873
These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot. Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.
I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.
Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).
The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes handling of accesses to ar.rsc via ptrace & restore_sigcontext
[With Thanks to Chris Wright for noticing the restore_sigcontext path]
Signed-off-by: Matthew Chapman <matthewc@hp.com>
Acked-by: David Mosberger <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Remove "pci=routeirq" option for ia64. This was a workaround
after ACPI IRQ routing was changed from "all at boot for everything
in _PRT" to "do it when the device is enabled" in case there were
drivers that didn't use pci_enable_device().
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The nested_dtlb_miss handler currently does not handle fault from
hugetlb address correctly. It walks the page table assuming PAGE_SIZE.
Thus when taking a fault triggered from hugetlb address, it would not
calculate the pgd/pmd/pte address correctly and thus result an incorrect
invocation of ia64_do_page_fault(). In there, kernel will signal SIGBUS
and application dies (The faulting address is perfectly legal and we
have a valid pte for the corresponding user hugetlb address as well).
This patch fix the described kernel bug. Since nested_dtlb_miss is a
rare event and a slow path anyway, I'm making the change without #ifdef
CONFIG_HUGETLB_PAGE for code readability. Tony, please apply.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
printk() calls should include appropriate KERN_* constant.
Signed-off-by: Christophe Lucas <clucas@rotomalug.org>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Allow the SGI simulator (medusa) to work on generic kernels. There is
no inherent dependency on an sn2-specific kernel.
Boot tested on Altix, medusa and HP rx2600.
Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Refresh arch/ia64/defconfig, as it was getting a bit stale. The only
manual changes I made were:
CONFIG_SCSI_SATA=y needed for some Altix base I/O cards
CONFIG_SCSI_SATA_VITESSE=y
CONFIG_DM_MULTIPATH=m the rest are already modules
CONFIG_FUSION_SPI=y new driver breakout
CONFIG_FUSION_FC=m
CONFIG_SGI_TIOCX=y enable some other SGI drivers
CONFIG_SGI_MBCS=m
CONFIG_AGP_SGI_TIOCA=m
Boot tested on Altix, HP rx2600 and Intel Tiger
Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The latest assembler catches this typo. (reported by Jim Wilson).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
current->blocked will be set to the value of current->thread_info->flags if the
cmpxchg to update thread_info->flags fails. For performance reasons the store into
current->blocked was placed in the cmpxchg loop. However, the cmpxchg overwrites the
register holding the value to be stored. In the rare case of a retry the value of
thread_info->flags will be written into current->blocked.
The fix is to use another register so that the register containing the current->blocked
value is not overwritten.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There've been reports of problems with CONFIG_PREEMPT=y and the high
floating point partition. This is caused by the possibility of preemption
and rescheduling on a different processor while saving or restioirng the
high partition.
The only places where the FPU state is touched are in ptrace, in
switch_to(), and where handling a floating-point exception. In switch_to()
preemption is off. So it's only in trap.c and ptrace.c that we need to
prevent preemption.
Here is a patch that adds commentary to make the conditions clear, and adds
appropriate preempt_{en,dis}able() calls to make it so. In trap.c I use
preempt_enable_no_resched(), as we're about to return to user space where
the preemption flag will be checked anyway.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
break.b does not store the break number in cr.iim, instead it stores 0,
which makes all break.b instructions look like BUG(). Extract the
break number from the instruction itself.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Christian Hildner pointed out that the comment did not match what the
code does in cpu_init() when we set up the default control register.
Patch based on suggestions from Ken Chen.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Some bits of the kernel assume that gp always points to valid memory,
in particular PHYSICAL_MODE_ENTER() assumes that both gp and sp are
valid virtual addresses with associated physical pages. The IA64
module loader puts gp well past the end of the module, with no physical
backing. Offsets on gp are still valid, but physical mode addressing
breaks for modules. Ensure that gp always falls within the module
body. Also ensure that gp is 8 byte aligned.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Fix a bug in which shub_1_1_found is not being properly initialized or set,
resulting in the improper setting of sn_hub_info->shub_1_1_found.
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This gets rid of an unused variable `error' in sys_ia32.c:sys32_epoll_wait()
Getting rid of this one makes parsing the output of the kernecomp
autobuild easier --- searching for `Error' to find a problem kept
hitting this one, even though it's only a warning.
Signed-off-by: Tony Luck <tony.luck@intel.com>
The attached patch cleans up a compilation warning when ACPI
is turned off (i.e., when compiling for the Ski simulator).
Signed-off-by: Tony Luck <tony.luck@intel.com>
In IA64 kernel, sys_mmap calls do_mmap2 and do_mmap2 returns addr if
len=0, which means the mmap sys call succeeds.
Posix.1 says:
The mmap() function shall fail if:
[EINVAL] The value of len is zero.
Here is a patch to fix it.
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Acked-by: David Mosberger <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
I applied the penultimate version of the perfmon patch, which didn't have
the initialization of the new spinlock that was added.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Patch from Charles Spirakis
Some linux customers want to optimize their applications on the latest
hardware but are not yet willing to upgrade to the latest kernel. This
patch provides a way to plug in an alternate, basic, and GPL'ed PMU
subsystem to help with their monitoring needs or for specialty work. It
can also be used in case of serious unexpected bugs in perfmon. Mutual
exclusion between the two subsystems is guaranteed, hence no conflict
can arise from both subsystem being present.
Acked-by: Stephane Eranian <eranian@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The 2.6 kernel has CPE error thresholding.
This patch lets SAL know of this error handling feature.
The changes are SN specific.
Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck <tony.luck@intel.com>
acpi_request_vector() is called in ia64_mca_init() to get the cpe_vector.
The problem is that acpi_request_vector() looks in platform_intr_list[] to
get the vector, but platform_intr_list[] is not initialized with a valid
vector until later (in sn_setup()). Without a valid vector the code
defaults to polling mode.
This patch moves the call to acpi_request_vector() from ia64_mca_init()
to ia64_mca_late_init(), which is after platform_intr_list[] is initialized.
Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck <tony.luck@intel.com>
convert_to_non_syscall() has the same problem that unwind_to_user()
used to have. Fix it likewise.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
These days <linux/ioctl32.h> handles everything, no need for an asm
header on just two architectures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some time ago, GAS was fixed to bring the .spillpsp directive in line
with the Intel assembler manual (there was some disagreement as to
whether or not there is a built-in 16-byte offset). Unfortunately,
there are two places in the kernel where this directive is used in
handwritten assembly files and those of course relied on the "buggy"
behavior. As a result, when using a "fixed" assembler, the kernel
picks up the UNaT bits from the wrong place (off by 16) and randomly
sets NaT bits on the scratch registers. This can be noticed easily by
looking at a coredump and finding various scratch registers with
unexpected NaT values. The patch below fixes this by using the
.spillsp directive instead, which works correctly no matter what
assembler is in use.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
I noticed this typo when trying to compile a kernel which had
CONFIG_HOTPLUG turned off. In that case, __devinit is no longer a
no-op and the compiler then detects a section-conflict. Fix by using
__devinitdata instead of __devinit.
Same patch also submitted by Darren Williams to fix compilation error
using sim_defconfig (which has CONFIG_HOTPLUG=n).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Darren Williams <dsw@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Without this patch, the stack is placed _below_ the current task
structure, which is risky at best.
Tony, I think this patch needs to go into 2.6.12, since it fixes a
real bug. Without it, INIT may case secondary errors, which would be
most unpleasant.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
While looking at code generated by gcc4.0 I noticed some functions still
had frame pointers, even after we stopped ppc64 from defining
CONFIG_FRAME_POINTER. It turns out kernel/Makefile hardwires
-fno-omit-frame-pointer on when compiling schedule.c.
Create CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER and define it on architectures
that dont require frame pointers in sched.c code.
(akpm: blame me for the name)
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the p_nodepda and p_subnodepda pointers from the pda_s structure.
And then define a new per-cpu pointer to the nodepda and export it so
that it can be accessed by kernel modules.
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
- pfm_context_load(): change return value from EINVAL to EBUSY
when context is already loaded.
- pfm_check_task_state(): pass test if context state is MASKED.
It is safe to give access on PFM_CTX_MASKED because the PMU
state (PMD) is stable and saved in software state.
This helps multiplexing programs such as the example given
in libpfm-3.1.
Signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The pmu_active test is based on the values of PSR.up. THIS IS THE PROBLEM as
it does not take into account the lazy restore logic which is as follow (simplified):
context switch out:
save PMDs
clear psr.up
release ownership
context switch in:
if (ctx->last_cpu == smp_processor_id() && ctx->cpu_activation == cpu_activation) {
set psr.up
return
}
restore PMD
restore PMC
ctx->last_cpu = smp_processor_id();
ctx->activation = ++cpu_activation;
set psr.up
The key here is that on context switch out, we clear psr.up and on context switch in
we check if nobody else used the PMU on that processor since last time we came. In
that case, we assume the PMD/PMC are ours and we simply reactivate.
The Caliper problem is that between the moment we context switch out and the moment we
come back, nobody effectively used the PMU BUT the processor went idle. Normally this
would have no incidence but PAL_HALT does alter the PMU registers. In default_idle(),
the test on psr.up is not strong enough to cover this case and we go into PAL which
trashed the PMU resgisters. When we come back we falsely assume that this is our state
yet it is corrupted. Very nasty indeed.
To avoid the problem it is necessary to forbid going to PAL_HALT as soon as perfmon
installs some valid state in the PMU registers. This happens with an application
attaches a context to a thread or CPU. It is not enough to check the psr/dcr bits.
Hence I propose the attached patch. It adds a callback in process.c to modify the
condition to enter PAL on idle. Basically, now it is conditional to pal_halt=1 AND
perfmon saying it is okay.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Correct a bug where tioca_dma_mapped() is putting tioca dma map structs
on the wrong list.
Signed-off-by: Mark Maule <maule@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When SAL calls back into the OS, the OS code is running with preempt
disabled so it cannot call sleeping functions.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Jack Steiner uncovered some opportunities for improvement in
the MCA recovery code.
1) Set bsp to save registers on the kernel stack.
2) Disable interrupts while in the MCA recovery code.
3) Change the way the user process is killed, to avoid
a panic in schedule.
Testing shows that these changes make the recovery code much
more reliable with the 2.6.12 kernel.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>