Commit Graph

1502 Commits

Author SHA1 Message Date
Andi Kleen
ef9257668e x86: do kernel direct mapping at boot using GB pages
The AMD Fam10h CPUs support new Gigabyte page table entry for
mapping 1GB at a time. Use this for the kernel direct mapping.

Only done for 64bit because i386 does not support GB page tables.

This only applies to the data portion of the direct mapping; the
kernel text mapping stays with 2MB pages because the AMD Fam10h
microarchitecture does not support GB ITLBs and AMD recommends
against using GB mappings for code.

Can be disabled with disable_gbpages on the kernel command line

[ tglx@linutronix.de: simplify enable code ]
[ Yinghai Lu <yinghai.lu@sun.com>: boot fix on 256 GB RAM ]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
Ingo Molnar
00d1c5e057 x86: add gbpages switches
These new controls toggle experimental support for a new CPU feature,
the straightforward extension of largepages from the pmd level to the
pud level, which allows 1GB (kernel) TLBs instead of 2MB TLBs.

Turn it off by default, as this code has not been tested well enough yet.

Use the CONFIG_DIRECT_GBPAGES=y .config option or gbpages on the
boot line can be used to enable it. If enabled in the .config then
nogbpages boot option disables it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
H. Peter Anvin
fe770bf031 x86: clean up the page table dumper and add 32-bit support
Clean up the page table dumper (fix boundary conditions, table driven
address ranges, some formatting changes since it is no longer using
the kernel log but a separate virtual file), and generalize to 32
bits.

[ mingo@elte.hu: x86: fix the pagetable dumper ]

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
Arjan van de Ven
926e5392ba x86: add code to dump the (kernel) page tables for visual inspection by kernel developers
This patch adds code to the kernel to have an (optional)
/proc/kernel_page_tables debug file that basically dumps the kernel
pagetables; this allows us kernel developers to verify that nothing fishy is
going on and that the various mappings are set up correctly. This was quite
useful in finding various change_page_attr() bugs, and is very likely to be
useful in the future as well.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: mingo@elte.hu
Cc: tglx@tglx.de
Cc: hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
H. Peter Anvin
2596e0fae0 x86: unify arch/x86/mm/Makefile
Unify arch/x86/mm/Makefile between 32 and 64 bits.

All configuration variables that are protected by Kconfig constraints
have been put in the common part of the Makefile; however, the NUMA
files are totally different between 32 and 64 bits and are handled via
an ifdef.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
Thomas Gleixner
ee7ae7a198 x86: add debug info to DEBUG_PAGEALLOC
Add debug information for DEBUG_PAGEALLOC to get some statistics about
the pool usage and split status.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 17:40:45 +02:00
Roland McGrath
5de253cc5b x86 vDSO: don't map 32-bit vdso when disabled
We map a VMA for the 32-bit vDSO even when it's disabled, which is stupid.
For the 32-bit kernel it's the vdso_enabled boot parameter/sysctl
and for the 64-bit kernel it's the vdso32 boot parameter/syscall32 sysctl.

When it's disabled, we don't pass AT_SYSINFO_EHDR so processes don't use
the vDSO for anything, but we still map it.  For the non-compat vDSO,
this means we're always putting an extra VMA somewhere, maybe lousing
up the control of the address space the user was hoping for.

Honor the setting by doing nothing in arch_setup_additional_pages.

[ also see: "x86 vDSO: don't use disabled vDSO for signal trampoline" ]

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 17:40:45 +02:00
Roland McGrath
1a3e4ca41c x86 vDSO: don't use disabled vDSO for signal trampoline
If the vDSO was not mapped, don't use it as the "restorer" for a signal
handler.  Whether we have a pointer in mm->context.vdso depends on what
happened at exec time, so we shouldn't check any global flags now.

Background:

Currently, every 32-bit exec gets the vDSO mapped even if it's disabled
(the process just doesn't get told about it).  Because it's in fact
always there, the bug that this patch fixes cannot happen now.  With
the second patch, it won't be mapped at all when it's disabled, which is
one of the things that people might really want when they disable it (so
nothing they didn't ask for goes into their address space).

The 32-bit signal handler setup when SA_RESTORER is not used refers to
current->mm->context.vdso without regard to whether the vDSO has been
disabled when the process was exec'd.  This patch fixes this not to use
it when it's null, which becomes possible after the second patch. (This
never happens in normal use, because glibc's sigaction call uses
SA_RESTORER unless glibc detected the vDSO.)

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 17:40:45 +02:00
Ingo Molnar
85eb69a16a x86: increase the kernel text limit to 512 MB
people sometimes do crazy stuff like building really large static
arrays into their kernels or building allyesconfig kernels. Give
more space to the kernel and push modules up a bit: kernel has
512 MB and modules have 1.5 GB.

Should be enough for a few years ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 17:40:45 +02:00
Ingo Molnar
b4e0409a36 x86: check vmlinux limits, 64-bit
these build-time and link-time checks would have prevented the
vmlinux size regression.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 17:40:45 +02:00
Björn Steinbrink
223ac2f42d x86, pci: fix off-by-one errors in some pirq warnings
fix bogus pirq warnings reported in:

  http://bugzilla.kernel.org/show_bug.cgi?id=10366

safe to be backported to v2.6.25 and earlier.

Cc: stable@kernel.org
Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 17:40:45 +02:00
yakui.zhao@intel.com
b87e81e5c6 acpi: unneccessary to scan the PCI bus already scanned
http://bugzilla.kernel.org/show_bug.cgi?id=10124

this change:

      commit 08f1c192c3
      Author: Muli Ben-Yehuda <muli@il.ibm.com>
      Date:   Sun Jul 22 00:23:39 2007 +0300

         x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata

         This patch introduces struct pci_sysdata to x86 and x86-64, and
         converts the existing two users (NUMA, Calgary) to use it.

         This lays the groundwork for having other users of sysdata, such as
         the PCI domains work.

         The Calgary bits are tested, the NUMA bits just look ok.

replaces pcibios_scan_root by pci_scan_bus_parented...

but in pcibios_scan_root we have a check about scanned busses.

Cc: <yakui.zhao@intel.com>
Cc: Stian Jordet <stian@jordet.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-15 19:35:41 -07:00
Roland McGrath
54a0151041 asmlinkage_protect replaces prevent_tail_call
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs.  The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.

Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.

More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function.  This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else.  It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-10 17:28:26 -07:00
Venki Pallipadi
783e391b7b x86: Simplify cpu_idle_wait
This patch also resolves hangs on boot:
	http://lkml.org/lkml/2008/2/23/263
	http://bugzilla.kernel.org/show_bug.cgi?id=10093

The bug was causing once-in-few-reboots 10-15 sec wait during boot on
certain laptops.

Earlier commit 40d6a14662 added
smp_call_function in cpu_idle_wait() to kick cpus that are in tickless
idle.  Looking at cpu_idle_wait code at that time, code seemed to be
over-engineered for a case which is rarely used (while changing idle
handler).

Below is a simplified version of cpu_idle_wait, which just makes a dummy
smp_call_function to all cpus, to make them come out of old idle handler
and start using the new idle handler.  It eliminates code in the idle
loop to handle cpu_idle_wait.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-10 15:38:29 -07:00
Steven Rostedt
f4be31ec96 pop previous section in alternative.c
gcc expects all toplevel assembly to return to the original section type.
The code in alteranative.c does not do this. This caused some strange bugs
in sched-devel where code would end up in the .rodata section and when
the kernel sets the NX bit on all .rodata, the kernel would crash when
executing this code.

This patch adds a .previous marker to return the code back to the
original section.

Credit goes to Andrew Pinski for telling me it wasn't a gcc bug but a
bug in the toplevel asm code in the kernel.  ;-)

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-09 18:38:08 -07:00
Karsten Wiese
4f41c94d5c x86: fix call to set_cyc2ns_scale() from time_cpufreq_notifier()
In time_cpufreq_notifier() the cpu id to act upon is held in freq->cpu. Use it
instead of smp_processor_id() in the call to set_cyc2ns_scale().
This makes the preempt_*able() unnecessary and lets set_cyc2ns_scale() update
the intended cpu's cyc2ns.

Related mail/thread: http://lkml.org/lkml/2007/12/7/130

Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-07 21:09:14 +02:00
Ingo Molnar
5b13d86357 revert "x86: tsc prevent time going backwards"
revert:

| commit 47001d6033
| Author: Thomas Gleixner <tglx@linutronix.de>
| Date:   Tue Apr 1 19:45:18 2008 +0200
|
|     x86: tsc prevent time going backwards

it has been identified to cause suspend regression - and the
commit fixes a longstanding bug that existed before 2.6.25 was
opened - so it can wait some more until the effects are better
understood.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-07 21:09:14 +02:00
Rusty Russell
64ba4f230d Fix booting pentium+ with dodgy TSC
We handle a broken tsc these days, so no need to panic.  We clear the
TSC bit when tsc_init decides it's unreliable (eg.  under lguest w/ bad
host TSC), leading to bogus panic.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-06 16:10:40 -07:00
Thomas Gleixner
5761d64b27 x86: revert assign IRQs to hpet timer
The commits:

commit 37a47db8d7
Author: Balaji Rao <balajirrao@gmail.com>
Date:   Wed Jan 30 13:30:03 2008 +0100

    x86: assign IRQs to HPET timers, fix

and

commit e3f37a54f6
Author: Balaji Rao <balajirrao@gmail.com>
Date:   Wed Jan 30 13:30:03 2008 +0100

    x86: assign IRQs to HPET timers

have been identified to cause a regression on some platforms due to
the assignement of legacy IRQs which makes the legacy devices
connected to those IRQs disfunctional.

Revert them.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=10382

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:49 +02:00
Thomas Gleixner
47001d6033 x86: tsc prevent time going backwards
We already catch most of the TSC problems by sanity checks, but there
is a subtle bug which has been in the code for ever. This can cause
time jumps in the range of hours.

This was reported in:
     http://lkml.org/lkml/2007/8/23/96
and
     http://lkml.org/lkml/2008/3/31/23

I was able to reproduce the problem with a gettimeofday loop test on a
dual core and a quad core machine which both have sychronized
TSCs. The TSCs seems not to be perfectly in sync though, but the
kernel is not able to detect the slight delta in the sync check. Still
there exists an extremly small window where this delta can be observed
with a real big time jump. So far I was only able to reproduce this
with the vsyscall gettimeofday implementation, but in theory this
might be observable with the syscall based version as well.

CPU 0 updates the clock source variables under xtime/vyscall lock and
CPU1, where the TSC is slighty behind CPU0, is reading the time right
after the seqlock was unlocked.

The clocksource reference data was updated with the TSC from CPU0 and
the value which is read from TSC on CPU1 is less than the reference
data. This results in a huge delta value due to the unsigned
subtraction of the TSC value and the reference value. This algorithm
can not be changed due to the support of wrapping clock sources like
pm timer.

The huge delta is converted to nanoseconds and added to xtime, which
is then observable by the caller. The next gettimeofday call on CPU1
will show the correct time again as now the TSC has advanced above the
reference value.

To prevent this TSC specific wreckage we need to compare the TSC value
against the reference value and return the latter when it is larger
than the actual TSC value.

I pondered to mark the TSC unstable when the readout is smaller than
the reference value, but this would render an otherwise good and fast
clocksource unusable without a real good reason.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:49 +02:00
Mark McLoughlin
c946c7de49 xen: Clear PG_pinned in release_{pt,pd}()
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: xen-devel@lists.xensource.com
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:48 +02:00
Mark McLoughlin
a684d69d15 xen: Do not pin/unpin PMD pages
i.e. with this simple test case:

    int fd = open("/dev/zero", O_RDONLY);
    munmap(mmap((void *)0x40000000, 0x1000_LEN, PROT_READ, MAP_PRIVATE, fd, 0), 0x1000);
    close(fd);

we currently get:

   kernel BUG at arch/x86/xen/enlighten.c:678!
   ...
   EIP is at xen_release_pt+0x79/0xa9
   ...
   Call Trace:
    [<c041da25>] ? __pmd_free_tlb+0x1a/0x75
    [<c047a192>] ? free_pgd_range+0x1d2/0x2b5
    [<c047a2f3>] ? free_pgtables+0x7e/0x93
    [<c047b272>] ? unmap_region+0xb9/0xf5
    [<c047c1bd>] ? do_munmap+0x193/0x1f5
    [<c047c24f>] ? sys_munmap+0x30/0x3f
    [<c0408cce>] ? syscall_call+0x7/0xb
    =======================

and xen complains:

  (XEN) mm.c:2241:d4 Mfn 1cc37 not pinned

Further details at:

  https://bugzilla.redhat.com/436453

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: xen-devel@lists.xensource.com
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:48 +02:00
Mark McLoughlin
f64337062c xen: refactor xen_{alloc,release}_{pt,pd}()
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: xen-devel@lists.xensource.com
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:48 +02:00
Pavel Machek
8f59610de2 x86, agpgart: scary messages are fortunately obsolete
Fix obsolete printks in aperture-64. We used not to handle missing
agpgart, but we handle it okay now.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:46 +02:00
Ingo Molnar
9c9b81f773 x86: print message if nmi_watchdog=2 cannot be enabled
right now if there's no CPU support for nmi_watchdog=2 we'll just
refuse it silently.

print a useful warning.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:45 +02:00
Ingo Molnar
4f14bdef41 x86: fix nmi_watchdog=2 on Pentium-D CPUs
implement nmi_watchdog=2 on this class of CPUs:

  cpu family      : 15
  model           : 6
  model name      : Intel(R) Pentium(R) D CPU 3.00GHz

the watchdog's ->setup() method is safe anyway, so if the CPU
cannot support it we'll bail out safely.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-04 18:36:45 +02:00
Roland McGrath
4ba51fd75c x86 ptrace: avoid unnecessary wrmsr
This avoids using wrmsr on MSR_IA32_DEBUGCTLMSR when it's not needed.
No wrmsr ever needs to be done if noone has ever used block stepping.

Without this change, using ptrace on 2.6.25 on an x86 KVM guest
will tickle KVM's missing support for the MSR and crash the guest
kernel.  Though host KVM is the buggy one, this makes for a regression
in the guest behavior from 2.6.24->2.6.25 that we can easily avoid.

I also corrected some bad whitespace.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-03 15:42:43 -07:00
Ken'ichi Ohmichi
629c8b4cdb vmcoreinfo: add the symbol "phys_base"
Fix the problem that makedumpfile sometimes fails on x86_64 machine.

This patch adds the symbol "phys_base" to a vmcoreinfo data.  The
vmcoreinfo data has the minimum debugging information only for dump
filtering.  makedumpfile (dump filtering command) gets it to distinguish
unnecessary pages, and makedumpfile creates a small dumpfile.

On x86_64 kernel which compiled with CONFIG_PHYSICAL_START=0x0 and
CONFIG_RELOCATABLE=y, makedumpfile fails like the following:

 # makedumpfile -d31 /proc/vmcore dumpfile
 The kernel version is not supported.
 The created dumpfile may be incomplete.
 _exclude_free_page: Can't get next online node.

 makedumpfile Failed.
 #

The cause is the lack of the symbol "phys_base" in a vmcoreinfo data.
If the symbol "phys_base" does not exist, makedumpfile considers an
x86_64 kernel as non relocatable.  As the result, makedumpfile
misunderstands the physical address where the kernel is loaded, and it
cannot translate a kernel virtual address to physical address correctly.

To fix this problem, this patch adds the symbol "phys_base" to a
vmcoreinfo data.

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-02 15:28:19 -07:00
Andrew Morton
9c312058b2 Avoid false positive warnings in kmap_atomic_prot() with DEBUG_HIGHMEM
I believe http://bugzilla.kernel.org/show_bug.cgi?id=10318 is a false
positive.  There's no way in which networking will be using highmem pages
here, so it won't be taking the KM_USER0 kmap slot, so there's no point in
performing these checks.

Cc: Pawel Staszewski <pstaszewski@artcom.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 [ Really sad.  We lose almost all real-life coverage of the debug tests
   with this patch. Now it will only report problems for the cases where
   people actually end up using a HIGHMEM page, not when they just _might_
   use one.    - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 13:08:14 -07:00
Rusty Russell
a6bd8e1303 lguest: comment documentation update.
Took some cycles to re-read the Lguest Journey end-to-end, fix some
rot and tighten some phrases.

Only comments change.  No new jokes, but a couple of recycled old jokes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-03-28 11:05:54 +11:00
Ingo Molnar
3085354de6 x86: prefetch fix #2
Linus noticed a second bug and an uncleanliness:

 - we'd return on any instruction fetch fault

 - we'd use both the value of 16 and the PF_INSTR symbol which are
   the same and make no sense

the cleanup nicely unifies this piece of logic.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 22:00:16 +01:00
Jeremy Fitzhardinge
2e8fe719b5 xen: fix UP setup of shared_info
We need to set up the shared_info pointer once we've mapped the real
shared_info into its fixmap slot.  That needs to happen once the general
pagetable setup has been done.  Previously, the UP shared_info was set
up one in xen_start_kernel, but that was left pointing to the dummy
shared info.  Unfortunately there's no really good place to do a later
setup of the shared_info in UP, so just do it once the pagetable setup
has been done.

[ Stable: needed in 2.6.24.x ]

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 16:08:45 +01:00
Jeremy Fitzhardinge
04c44a080d xen: fix RMW when unmasking events
xen_irq_enable_direct and xen_sysexit were using "andw $0x00ff,
XEN_vcpu_info_pending(vcpu)" to unmask events and test for pending ones
in one instuction.

Unfortunately, the pending flag must be modified with a locked operation
since it can be set by another CPU, and the unlocked form of this
operation was causing the pending flag to get lost, allowing the processor
to return to usermode with pending events and ultimately deadlock.

The simple fix would be to make it a locked operation, but that's rather
costly and unnecessary.  The fix here is to split the mask-clearing and
pending-testing into two instructions; the interrupt window between
them is of no concern because either way pending or new events will
be processed.

This should fix lingering bugs in using direct vcpu structure access too.

[ Stable: needed in 2.6.24.x ]

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 16:08:45 +01:00
Christoph Lameter
25e59881f1 x86: stricter check in follow_huge_addr()
The first page of the compound page is determined in follow_huge_addr()
but then PageCompound() only checks if the page is part of a compound page.
PageHead() allows checking if this is indeed the first page of the
compound.

Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 16:08:45 +01:00
Florian Fainelli
b2ef749720 rdc321x: GPIO routines bugfixes
This patch fixes the use of GPIO routines which are in the PCI
configuration space of the RDC321x, therefore reading/writing
to this space without spinlock protection can be problematic.

We also now request and free GPIOs and support the MGB100
board, previous code was very AR525W-centric.

Signed-off-by: Volker Weiss <volker@tintuc.de>
Signed-off-by: Florian Fainelli <florian.fainelli@telecomint.eu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 16:08:45 +01:00
Andrew Morton
d8d4f157b8 x86: ptrace.c: fix defined-but-unused warnings
arch/x86/kernel/ptrace.c:548: warning: 'ptrace_bts_get_size' defined but not used
arch/x86/kernel/ptrace.c:558: warning: 'ptrace_bts_read_record' defined but not used
arch/x86/kernel/ptrace.c:607: warning: 'ptrace_bts_clear' defined but not used
arch/x86/kernel/ptrace.c:617: warning: 'ptrace_bts_drain' defined but not used
arch/x86/kernel/ptrace.c:720: warning: 'ptrace_bts_config' defined but not used
arch/x86/kernel/ptrace.c:788: warning: 'ptrace_bts_status' defined but not used

Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-27 16:08:44 +01:00
Ingo Molnar
bc713dcf35 x86: fix prefetch workaround
some early Athlon XP's and Opterons generate bogus faults on prefetch
instructions. The workaround for this regressed over .24 - reinstate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-27 16:08:44 +01:00
Suresh Siddha
d546b67a94 x86: fix performance drop for glx
fix the 3D performance drop reported at:

   http://bugzilla.kernel.org/show_bug.cgi?id=10328

fb drivers are using ioremap()/ioremap_nocache(), followed by mtrr_add with
WC attribute. Recent changes in page attribute code made both
ioremap()/ioremap_nocache() mappings as UC (instead of previous UC-). This
breaks the graphics performance, as the effective memory type is UC instead
of expected WC.

The correct way to fix this is to add ioremap_wc() (which uses UC- in the
absence of PAT kernel support and WC with PAT) and change all the
fb drivers to use this new ioremap_wc() API.

We can take this correct and longer route for post 2.6.25. For now,
revert back to the UC- behavior for ioremap/ioremap_nocache.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-26 22:23:41 +01:00
Yinghai Lu
76c324182b x86: fix trim mtrr not to setup_memory two times
we could call find_max_pfn() directly instead of setup_memory() to get
max_pfn needed for mtrr trimming.

otherwise setup_memory() is called two times... that is duplicated...

[ mingo@elte.hu: both Thomas and me simulated a double call to
  setup_bootmem_allocator() and can confirm that it is a real bug
  which can hang in certain configs. It's not been reported yet but
  that is probably due to the relatively scarce nature of
  MTRR-trimming systems. ]

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-26 22:23:41 +01:00
Andres Salomon
923a0cf82f x86: GEODE: add missing module.h include
On Wed, 26 Mar 2008 11:56:22 -0600
Jordan Crouse <jordan.crouse@amd.com> wrote:

> On 26/03/08 14:31 +0100, Stefan Pfetzing wrote:
> > Hello Jordan,
> >
> > I just tried to build your geodwdt driver for the geode watchdog. Therefore
> > I pulled your repository from http://git.infradead.org/geode.git (or more,
> > the git url).
> >
> > I tried to build the geodewdt driver as a module - which didn't work, and
> > it failed with the same problem as earlier mentioned on lkmk [1]. I also
> > checked the fix [2], but that seems to be already in your (or linus) tree -
> > and so I'm unsure what the problem is.
> >
> > [1] http://kerneltrap.org/mailarchive/linux-kernel/2008/2/17/884074
> > [2] http://kerneltrap.org/mailarchive/linux-kernel/2008/2/17/884174
> >
> > Building directly into the kernel seems to work.
> >
> > Maybe you have some idea?
>
> Hmm - that is strange.  Exporting the symbols should work.  I recommend
> starting over with a clean tree.
>
> CCing Andres - any thoughts?
>
> Jordan
>

Er, yeah.  The patch below should fix it.  This should probably go into
2.6.25.

Oops, EXPORT_SYMBOL_GPL wasn't being declared due to this header
being missing.

Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-26 22:23:40 +01:00
Stephan Diestelhorst
c6e8256a7b x86, cpufreq: fix Speedfreq-SMI call that clobbers ECX
I have found that using SMI to change the cpu's frequency on my DELL
Latitude L400 clobbers the ECX register in speedstep_set_state, causing
unneccessary retries because the "state" variable has changed silently (GCC
assumes it is still present in ECX).

play safe and avoid gcc caching any register across IO port accesses
that trigger SMIs.

Signed-off by: <Stephan.Diestelhorst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-26 22:23:40 +01:00
Yinghai Lu
475613b9e3 x86: fix memoryless node oops during boot
fix oops during boot reported in this thread:

  http://lkml.org/lkml/2008/2/6/65

enable booting on memoryless nodes.

Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-26 22:23:40 +01:00
Ingo Molnar
3c274c2909 x86: add dmi quirk for io_delay
reported by mereandor@gmail.com, in:

  http://bugzilla.kernel.org/show_bug.cgi?id=6307

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-26 22:23:40 +01:00
Randy Dunlap
1d3381ebf4 x86: convert mtrr/generic.c to kernel-doc
Convert function comment blocks to kernel-doc notation.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-26 22:23:40 +01:00
Avi Kivity
e48bb497b9 KVM: MMU: Fix memory leak on guest demand faults
While backporting 72dc67a696, a gfn_to_page()
call was duplicated instead of moved (due to an unrelated patch not being
present in mainline).  This caused a page reference leak, resulting in a
fairly massive memory leak.

Fix by removing the extraneous gfn_to_page() call.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
707a18a51d KVM: VMX: convert init_rmode_tss() to slots_lock
init_rmode_tss was forgotten during the conversion from mmap_sem to
slots_lock.

INFO: task qemu-system-x86:3748 blocked for more than 120 seconds.
Call Trace:
 [<ffffffff8053d100>] __down_read+0x86/0x9e
 [<ffffffff8053fb43>] do_page_fault+0x346/0x78e
 [<ffffffff8053d235>] trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8053dcad>] error_exit+0x0/0xa9
 [<ffffffff8035a7a7>] copy_user_generic_string+0x17/0x40
 [<ffffffff88099a8a>] :kvm:kvm_write_guest_page+0x3e/0x5f
 [<ffffffff880b661a>] :kvm_intel:init_rmode_tss+0xa7/0xf9
 [<ffffffff880b7d7e>] :kvm_intel:vmx_vcpu_reset+0x10/0x38a
 [<ffffffff8809b9a5>] :kvm:kvm_arch_vcpu_setup+0x20/0x53
 [<ffffffff8809a1e4>] :kvm:kvm_vm_ioctl+0xad/0x1cf
 [<ffffffff80249dea>] __lock_acquire+0x4f7/0xc28
 [<ffffffff8028fad9>] vfs_ioctl+0x21/0x6b
 [<ffffffff8028fd75>] do_vfs_ioctl+0x252/0x26b
 [<ffffffff8028fdca>] sys_ioctl+0x3c/0x5e
 [<ffffffff8020b01b>] system_call_after_swapgs+0x7b/0x80

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
15aaa819e2 KVM: MMU: handle page removal with shadow mapping
Do not assume that a shadow mapping will always point to the same host
frame number.  Fixes crash with madvise(MADV_DONTNEED).

[avi: move after first printk(), add another printk()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Avi Kivity
4b1a80fa65 KVM: MMU: Fix is_rmap_pte() with io ptes
is_rmap_pte() doesn't take into account io ptes, which have the avail bit set.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Avi Kivity
5dc8326282 KVM: VMX: Restore tss even on x86_64
The vmx hardware state restore restores the tss selector and base address, but
not its length.  Usually, this does not matter since most of the tss contents
is within the default length of 0x67.  However, if a process is using ioperm()
to grant itself I/O port permissions, an additional bitmap within the tss,
but outside the default length is consulted.  The effect is that the process
will receive a SIGSEGV instead of transparently accessing the port.

Fix by restoring the tss length.  Note that i386 had this working already.

Closes bugzilla 10246.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Linus Torvalds
b9e76a0074 x86-32: Pass the full resource data to ioremap()
It appears that 64-bit PCI resources cannot possibly ever have worked on
x86-32 even when the RESOURCES_64BIT config option was set, because any
driver that tried to [pci_]ioremap() the resource would have been unable
to do so because the high 32 bits would have been silently dropped on
the floor by the ioremap() routines that only used "unsigned long".

Change them to use "resource_size_t" instead, which properly encodes the
whole 64-bit resource data if RESOURCES_64BIT is enabled.

Acked-by: H. Peter Anvin <hpa@kernel.org>
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 11:22:39 -07:00