startup_ipi_hook depends on CONFIG_X86_LOCAL_APIC, so move it to the
right part of the paravirt_ops initialization.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Fix section mismatch warnings in mtrr code.
Fix line length on one source line.
WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.data: from .text.get_mtrr_state after 'get_mtrr_state' (at offset 0x103)
WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text: from .text.get_mtrr_state after 'get_mtrr_state' (at offset 0x180)
WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text: from .text.get_mtrr_state after 'get_mtrr_state' (at offset 0x199)
WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text: from .text.get_mtrr_state after 'get_mtrr_state' (at offset 0x1c1)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:18 +02:00
Fernando Luis [** ISO-8859-1 charset **] VzquezCao
Use safe_apic_wait_icr_idle to check ICR idle bit if the vector is
NMI_VECTOR to avoid potential hangups in the event of crash when kdump
tries to stop the other CPUs.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:18 +02:00
Fernando Luis [** ISO-8859-1 charset **] VzquezCao
Implement __send_IPI_dest_field which can be used to send IPIs when the
"destination shorthand" field of the ICR is set to 00 (destination
field). Use it whenever possible.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:18 +02:00
Fernando Luis [** ISO-8859-1 charset **] VzquezCao
Implement __send_IPI_dest_field which can be used to send IPIs when the
"destination shorthand" field of the ICR is set to 00 (destination
field). Use it whenever possible.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
inquire_remote_apic is used for APIC debugging, so use
safe_apic_wait_icr_idle instead of apic_wait_icr_idle to avoid possible
lockups when APIC delivery fails.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
__inquire_remote_apic is used for APIC debugging, so use
safe_apic_wait_icr_idle instead of apic_wait_icr_idle to avoid possible
lockups when APIC delivery fails.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
The functionality provided by the new safe_apic_wait_icr_idle is being
open-coded all over "kernel/smpboot.c". Use safe_apic_wait_icr_idle
instead to consolidate code and ease maintenance.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
The functionality provided by the new safe_apic_wait_icr_idle is being
open-coded all over "kernel/smpboot.c". Use safe_apic_wait_icr_idle
instead to consolidate code and ease maintenance.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
apic_wait_icr_idle looks like this:
static __inline__ void apic_wait_icr_idle(void)
{
while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
cpu_relax();
}
The busy loop in this function would not be problematic if the
corresponding status bit in the ICR were always updated, but that does
not seem to be the case under certain crash scenarios. Kdump uses an IPI
to stop the other CPUs in the event of a crash, but when any of the
other CPUs are locked-up inside the NMI handler the CPU that sends the
IPI will end up looping forever in the ICR check, effectively
hard-locking the whole system.
Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:
"A local APIC unit indicates successful dispatch of an IPI by
resetting the Delivery Status bit in the Interrupt Command
Register (ICR). The operating system polls the delivery status
bit after sending an INIT or STARTUP IPI until the command has
been dispatched.
A period of 20 microseconds should be sufficient for IPI dispatch
to complete under normal operating conditions. If the IPI is not
successfully dispatched, the operating system can abort the
command. Alternatively, the operating system can retry the IPI by
writing the lower 32-bit double word of the ICR. This “time-out”
mechanism can be implemented through an external interrupt, if
interrupts are enabled on the processor, or through execution of
an instruction or time-stamp counter spin loop."
Intel's documentation suggests the implementation of a time-out
mechanism, which, by the way, is already being open-coded in some parts
of the kernel that tinker with ICR.
Create a apic_wait_icr_idle replacement that implements the time-out
mechanism and that can be used to solve the aforementioned problem.
AK: moved both functions out of line
AK: Added improved loop from Keith Owens
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
apic_wait_icr_idle looks like this:
static __inline__ void apic_wait_icr_idle(void)
{
while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
cpu_relax();
}
The busy loop in this function would not be problematic if the
corresponding status bit in the ICR were always updated, but that does
not seem to be the case under certain crash scenarios. Kdump uses an IPI
to stop the other CPUs in the event of a crash, but when any of the
other CPUs are locked-up inside the NMI handler the CPU that sends the
IPI will end up looping forever in the ICR check, effectively
hard-locking the whole system.
Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:
"A local APIC unit indicates successful dispatch of an IPI by
resetting the Delivery Status bit in the Interrupt Command
Register (ICR). The operating system polls the delivery status
bit after sending an INIT or STARTUP IPI until the command has
been dispatched.
A period of 20 microseconds should be sufficient for IPI dispatch
to complete under normal operating conditions. If the IPI is not
successfully dispatched, the operating system can abort the
command. Alternatively, the operating system can retry the IPI by
writing the lower 32-bit double word of the ICR. This “time-out”
mechanism can be implemented through an external interrupt, if
interrupts are enabled on the processor, or through execution of
an instruction or time-stamp counter spin loop."
Intel's documentation suggests the implementation of a time-out
mechanism, which, by the way, is already being open-coded in some parts
of the kernel that tinker with ICR.
Create a apic_wait_icr_idle replacement that implements the time-out
mechanism and that can be used to solve the aforementioned problem.
AK: moved both functions out of line
AK: added improved loop from Keith Owens
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
If our copy of the MTRRs of the BSP has RdMem or WrMem set, and
we are running on an AMD64/K8 system, the boot CPU must have had
MtrrFixDramEn and MtrrFixDramModEn set (otherwise our RDMSR would
have copied these bits cleared), so we set them on this CPU as well.
This allows us to keep the AMD64/K8 RdMem and WrMem bits in sync
across the CPUs of SMP systems in order to fullfill the duty of
system software to "initialize and maintain MTRR consistency
across all processors." as written in the AMD and Intel manuals.
If an WRMSR instruction fails because MtrrFixDramModEn is not
set, I expect that also the Intel-style MTRR bits are not updated.
AK: minor cleanup, moved MSR defines around
Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Note: This patch didn'nt need an update since it's initial post.
Some BIOSes may modify fixed-range MTRRs in SMM, e.g. when they
transition the system into ACPI mode, which is entered thru an SMI,
triggered by Linux in acpi_enable().
SMIs which cause that Linux is interrupted and BIOS code is
executed (which may change e.g. fixed-range MTRRs) in SMM may
be raised by an embedded system controller which is often found
in notebooks also at other occasions.
If we would not update our copy of the fixed-range MTRRs before
suspending to RAM or to disk, restore_processor_state() would
set the fixed-range MTRRs of the BSP using old backup values
which may be outdated and this could cause the system to fail
later during resume.
This patch ensures that our copy of the fixed-range MTRRs
is updated when saving the boot processor state on suspend
to disk and suspend to RAM.
In combination with other patches this allows to fix s2ram
and s2disk on the Acer Ferrari 1000 notebook and at least
s2disk on the Acer Ferrari 5000 notebook.
Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Applied fix by Andew Morton:
http://lkml.org/lkml/2007/4/8/88 - Fix `make headers_check'.
AMD and Intel x86 CPU manuals state that it is the responsibility of
system software to initialize and maintain MTRR consistency across
all processors in Multi-Processing Environments.
Quote from page 188 of the AMD64 System Programming manual (Volume 2):
7.6.5 MTRRs in Multi-Processing Environments
"In multi-processing environments, the MTRRs located in all processors must
characterize memory in the same way. Generally, this means that identical
values are written to the MTRRs used by the processors." (short omission here)
"Failure to do so may result in coherency violations or loss of atomicity.
Processor implementations do not check the MTRR settings in other processors
to ensure consistency. It is the responsibility of system software to
initialize and maintain MTRR consistency across all processors."
Current Linux MTRR code already implements the above in the case that the
BIOS does not properly initialize MTRRs on the secondary processors,
but the case where the fixed-range MTRRs of the boot processor are changed
after Linux started to boot, before the initialsation of a secondary
processor, is not handled yet.
In this case, secondary processors are currently initialized by Linux
with MTRRs which the boot processor had very early, when mtrr_bp_init()
did run, but not with the MTRRs which the boot processor uses at the
time when that secondary processors is actually booted,
causing differing MTRR contents on the secondary processors.
Such situation happens on Acer Ferrari 1000 and 5000 notebooks where the
BIOS enables and sets AMD-specific IORR bits in the fixed-range MTRRs
of the boot processor when it transitions the system into ACPI mode.
The SMI handler of the BIOS does this in SMM, entered while Linux ACPI
code runs acpi_enable().
Other occasions where the SMI handler of the BIOS may change bits in
the MTRRs could occur as well. To initialize newly booted secodary
processors with the fixed-range MTRRs which the boot processor uses
at that time, this patch saves the fixed-range MTRRs of the boot
processor before new secondary processors are started. When the
secondary processors run their Linux initialisation code, their
fixed-range MTRRs will be updated with the saved fixed-range MTRRs.
If CONFIG_MTRR is not set, we define mtrr_save_state
as an empty statement because there is nothing to do.
Possible TODOs:
*) CPU-hotplugging outside of SMP suspend/resume is not yet tested
with this patch.
*) If, even in this case, an AP never runs i386/do_boot_cpu or x86_64/cpu_up,
then the calls to mtrr_save_state() could be replaced by calls to
mtrr_save_fixed_ranges(NULL) and mtrr_save_state() would not be
needed.
That would need either verification of the CPU-hotplug code or
at least a test on a >2 CPU machine.
*) The MTRRs of other running processors are not yet checked at this
time but it might be interesting to syncronize the MTTRs of all
processors before booting. That would be an incremental patch,
but of rather low priority since there is no machine known so
far which would require this.
AK: moved prototypes on x86-64 around to fix warnings
Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
In this current implementation which is used in other patches,
mtrr_save_fixed_ranges() accepts a dummy void pointer because
in the current implementation of one of these patches, this
function may be called from smp_call_function_single() which
requires that this function takes a void pointer argument.
This function calls get_fixed_ranges(), passing mtrr_state.fixed_ranges
which is the element of the static struct which stores our current
backup of the fixed-range MTRR values which all CPUs shall be
using.
Because mtrr_save_fixed_ranges calls get_fixed_ranges after
kernel initialisation time, __init needs to be removed from
the declaration of get_fixed_ranges().
If CONFIG_MTRR is not set, we define mtrr_save_fixed_ranges
as an empty statement because there is nothing to do.
AK: Moved prototypes for x86-64 around to fix warnings
Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Three cleanups:
1: ELF notes are never mapped, so there's no need to have any access
flags in their phdr.
2: When generating them from asm, tell the assembler to use a SHT_NOTE
section type. There doesn't seem to be a way to do this from C.
3: Use ANSI rather than traditional cpp behaviour to stringify the
macro argument.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Otherwise non GPL modules cannot even do basic operations
like disabling interrupts anymore, which would be excessive.
Longer term should split the single structure up into
internal and external symbols and not export the internal
ones at all.
Signed-off-by: Andi Kleen <ak@suse.de>
The other symbols used to delineate the alt-instructions sections have the
form __foo/__foo_end. Rename parainstructions to match.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert VMI timer to use clock events, making it properly able to use the NO_HZ
infrastructure. On UP systems, with no local APIC, we just continue to route
these events through the PIT. On systems with a local APIC, or SMP, we provide
a single source interrupt chip which creates the local timer IRQ. It actually
gets delivered by the APIC hardware, but we don't want to use the same local
APIC clocksource processing, so we create our own handler here.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Dan Hecht <dhecht@vmware.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
Implement vmi_kmap_atomic_pte in terms of the backend set_linear_mapping
operation. The conversion is rather straighforward; call kmap_atomic
and then inform the hypervisor of the page mapping.
The _flush_tlb damage is due to macros being pulled in from highmem.h.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
No, just no. You do not use goto to skip a code block. You do not
return an obvious variable from a singly-inlined function and give
the function a return value. You don't put unexplained comments
about kmalloc in code which doesn't do dynamic allocation. And
you don't leave stray warnings around for no good reason.
Also, when possible, it is better to use block scoped variables
because gcc can sometime generate better code.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Remove spurious comments, headers and keywords from x86-64 bugs.[ch].
Use identify_boot_cpu()
AK: merged with other patch
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
head.S creates the very initial pagetable for the kernel. This just
maps enough space for the kernel itself, and an allocation bitmap.
The amount of mapped memory is rounded up to 4Mbytes, and so this
typically ends up mapping 8Mbytes of memory.
When booting, pagetable_init() needs to create mappings for all
lowmem, and the pagetables for these mappings are allocated from the
free pages around the kernel in low memory. If the number of
pagetable pages + kernel size exceeds head.S's initial mapping, it
will end up faulting on an unmapped page. This will only happen with
specific combinations of kernel size and memory size.
This patch makes sure that head.S also maps enough space to fit the
kernel pagetables as well as the kernel itself. It ends up using an
additional two pages of unreclaimable memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Fixes two problems with the GDT when compiling for uniprocessor:
- There's no percpu segment, so trying to load its selector into %fs fails.
Use a null selector instead.
- The real gdt needs to be loaded at some point. Do it in cpu_init().
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Currently x86 (similar to x84-64) has a special per-cpu structure
called "i386_pda" which can be easily and efficiently referenced via
the %fs register. An ELF section is more flexible than a structure,
allowing any piece of code to use this area. Indeed, such a section
already exists: the per-cpu area.
So this patch:
(1) Removes the PDA and uses per-cpu variables for each current member.
(2) Replaces the __KERNEL_PDA segment with __KERNEL_PERCPU.
(3) Creates a per-cpu mirror of __per_cpu_offset called this_cpu_off, which
can be used to calculate addresses for this CPU's variables.
(4) Simplifies startup, because %fs doesn't need to be loaded with a
special segment at early boot; it can be deferred until the first
percpu area is allocated (or never for UP).
The result is less code and one less x86-specific concept.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Xen wants a dedicated page for the GDT. I believe VMI likes it too.
lguest, KVM and native don't care.
Simple transformation to page-aligned "struct gdt_page".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
inflate_fixed and huft_build together use around 2.7k of stack. When
using 4k stacks, I saw stack overflows from interrupts arriving while
unpacking the root initrd:
do_IRQ: stack overflow: 384
[<c0106b64>] show_trace_log_lvl+0x1a/0x30
[<c01075e6>] show_trace+0x12/0x14
[<c010763f>] dump_stack+0x16/0x18
[<c0107ca4>] do_IRQ+0x6d/0xd9
[<c010202b>] xen_evtchn_do_upcall+0x6e/0xa2
[<c0106781>] xen_hypervisor_callback+0x25/0x2c
[<c010116c>] xen_restore_fl+0x27/0x29
[<c0330f63>] _spin_unlock_irqrestore+0x4a/0x50
[<c0117aab>] change_page_attr+0x577/0x584
[<c0117b45>] kernel_map_pages+0x8d/0xb4
[<c016a314>] cache_alloc_refill+0x53f/0x632
[<c016a6c2>] __kmalloc+0xc1/0x10d
[<c0463d34>] malloc+0x10/0x12
[<c04641c1>] huft_build+0x2a7/0x5fa
[<c04645a5>] inflate_fixed+0x91/0x136
[<c04657e2>] unpack_to_rootfs+0x5f2/0x8c1
[<c0465acf>] populate_rootfs+0x1e/0xe4
(This was under Xen, but there's no reason it couldn't happen on bare
hardware.)
This patch mallocs the local variables, thereby reducing the stack
usage to sane levels.
Also, up the heap size for the kernel decompressor to deal with the
extra allocation.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Tim Yamin <plasmaroo@gentoo.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
In shadow mode hypervisors, ptep_get_and_clear achieves the desired
purpose of keeping the shadows in sync by issuing a native_get_and_clear,
followed by a call to pte_update, which indicates the PTE has been
modified.
Direct mode hypervisors (Xen) have no need for this anyway, and will trap
the update using writable pagetables.
This means no hypervisor makes use of ptep_get_and_clear; there is no
reason to have it in the paravirt-ops structure. Change confusing
terminology about raw vs. native functions into consistent use of
native_pte_xxx for operations which do not invoke paravirt-ops.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
kunmap_atomic should flush any pending lazy mmu updates, mainly to be
consistent with kmap_atomic, and to preserve its normal behaviour.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space. These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.
Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.
This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings. Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.
[ Zach - vmi.c will need some attention after this patch. It wasn't
immediately obvious to me what needs to be done. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Back out the map_pt_hook to clear the way for kmap_atomic_pte.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
This patch adds a pv_op for flush_tlb_others. Linux running on native
hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
have a particular mm's pagetable entries cached in its TLB. This is
inefficient in a paravirtualized environment, since the hypervisor
knows which real CPUs actually contain cached mappings, which may be a
small subset of a guest's VCPUs.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Implement the actual patching machinery. paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:
- if the paravirt_op function is paravirt_nop, then patch nops
- if the paravirt_op function is a jmp target, then jmp to it
- if the paravirt_op function is callable and doesn't clobber too much
for the callsite, call it directly
paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.
Backends may implement their own patcher, however. There are several
helper functions to help with this:
paravirt_patch_nop nop out a callsite
paravirt_patch_ignore leave the callsite as-is
paravirt_patch_call patch a call if the caller and callee
have compatible clobbers
paravirt_patch_jmp patch in a jmp
paravirt_patch_insns patch some literal instructions over
the callsite, if they fit
This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Acked-by: Ingo Molnar <mingo@elte.hu>
Fix a few clobbers to include the return register. The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.
Also, make sure that callsites which are used in contexts which don't
allow clobbers actually save and restore all clobberable registers.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure. This avoids having to maintain a separate
enum for patch site types.
Also, since the identifier is derived from the offset into
paravirt_ops, the offset can be derived from the identifier. This is
used to remove replicated information in the various callsite macros,
which has been a source of bugs in the past.
This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex - specifically because
it breaks the 1:1 relationship between identifiers and offsets. If
this operation turns out to be particularly beneficial, then the right
answer is to define a new entrypoint for it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Add hooks to allow a paravirt implementation to track the lifetime of
an mm. Paravirtualization requires three hooks, but only two are
needed in common code. They are:
arch_dup_mmap, which is called when a new mmap is created at fork
arch_exit_mmap, which is called when the last process reference to an
mm is dropped, which typically happens on exit and exec.
The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures. It's called when an mm is first used.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Normally when running in PAE mode, the 4th PMD maps the kernel address space,
which can be shared among all processes (since they all need the same kernel
mappings).
Xen, however, does not allow guests to have the kernel pmd shared between page
tables, so parameterize pgtable.c to allow both modes of operation.
There are several side-effects of this. One is that vmalloc will update the
kernel address space mappings, and those updates need to be propagated into
all processes if the kernel mappings are not intrinsically shared. In the
non-PAE case, this is done by maintaining a pgd_list of all processes; this
list is used when all process pagetables must be updated. pgd_list is
threaded via otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.
Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE
pgd to page aligned anyway, so this patch forces the pgd to be page
aligned+sized when the kernel pmd is unshared, to accomodate both these
requirements.
Also, since there may be several distinct kernel pmds (if the user/kernel
split is below 3G), there's no point in allocating them from a slab cache;
they're just allocated with get_free_page and initialized appropriately. (Of
course the could be cached if there is just a single kernel pmd - which is the
default with a 3G user/kernel split - but it doesn't seem worthwhile to add
yet another case into this code).
[ Many thanks to wli for review comments. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.
In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory. When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.
When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.
In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary. Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.
In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable. This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.
A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables. This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.
And:
+From: "Eric W. Biederman" <ebiederm@xmission.com>
If we don't set the leaf page table entries it is quite possible that
will inherit and incorrect page table entry from the initial boot
page table setup in head.S. So we need to redo the effort here,
so we pick up PSE, PGE and the like.
Hypervisors like Xen require that their page tables be read-only,
which is slightly incompatible with our low identity mappings, however
I discussed this with Jeremy he has modified the Xen early set_pte
function to avoid problems in this area.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Add a set of accessors to pack, unpack and modify page table entries
(at all levels). This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries. For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).
This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
[mingo] but only as a cleanup of the current open-coded (void *) casts.
My problem with this is that it loses the types. Not that there is much
to check for, but still, this adds some assumptions about how function
calls look like
Remove CONFIG_DEBUG_PARAVIRT. When inlining code, this option
attempts to trash registers in the patch-site's "clobber" field, on
the grounds that this should find bugs with incorrect clobbers.
Unfortunately, the clobber field really means "registers modified by
this patch site", which includes return values.
Because of this, this option has outlived its usefulness, so remove
it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
On Thu, 2007-03-29 at 13:16 +0200, Andi Kleen wrote:
> Please clean it up properly with two structs.
Not sure about this, now I've done it. Running it here.
If you like it, I can do x86-64 as well.
==
lguest defines its own TSS struct because the "struct tss_struct"
contains linux-specific additions. Andi asked me to split the struct
in processor.h.
Unfortunately it makes usage a little awkward.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
I have a 4 socket AMD Operton system. The 2.6.18 kernel I have crashes
when there is no memory in node0.
AK: changed call to _nopanic
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Setting the DEBUG_SIG flag breaks compilation due to a wrong
struct access. Aditionally, it raises two warnings. This is one
patch to fix them all.
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
The .smp_altinstructions section and its corresponding symbols are
completely unused, so remove them.
Also, remove stray #ifdef __KENREL__ in asm-i386/alternative.h
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Let's allow page-alignment in general for per-cpu data (wanted by Xen, and
Ingo suggested KVM as well).
Because larger alignments can use more room, we increase the max per-cpu
memory to 64k rather than 32k: it's getting a little tight.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As a bug workaround bank 0 on K7s is normally disabled, but no need
to do that on other AMD CPUs.
Cc: davej@redhat.com
Signed-off-by: Andi Kleen <ak@suse.de>
Update documentation for i386 smp_call_function* functions.
As reported by Randy Dunlap <rdunlap@xenotime.net>
[ I've posted this before but it seems to have been lost along the way. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Randy Dunlap <rdunlap@xenotime.net>
(I hope Andi is the right one to Cc, otherwise please add, thanks!)
Use menuconfigs instead of menus, so the whole menu can be disabled at
once instead of going through all options.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andi Kleen <ak@suse.de>
It doesn't put the CPU into deeper sleep states, so it's better to use the standard
idle loop to save power. But allow to reenable it anyways for benchmarking.
I also removed the obsolete idle=halt on i386
Cc: andreas.herrmann@amd.com
Signed-off-by: Andi Kleen <ak@suse.de>
Most of asm-x86_64/bugs.h is code which should be in a C file, so put it there.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Now that relocation of the VDSO for COMPAT_VDSO users is done at
runtime rather than compile time, it is possible to enable/disable
compat mode at runtime.
This patch allows you to enable COMPAT_VDSO mode with "vdso=2" on the
kernel command line, or via sysctl. (Switching on a running system
shouldn't be done lightly; any process which was relying on the compat
VDSO will be upset if it goes away.)
The COMPAT_VDSO config option still exists, but if enabled it just
makes vdso_enabled default to VDSO_COMPAT.
+From: Hugh Dickins <hugh@veritas.com>
Fix oops from i386-make-compat_vdso-runtime-selectable.patch.
Even mingetty at system startup finds it easy to trigger an oops
while reading /proc/PID/maps: though it has a good hold on the mm
itself, that cannot stop exit_mm() from resetting tsk->mm to NULL.
(It is usually show_map()'s call to get_gate_vma() which oopses,
and I expect we could change that to check priv->tail_vma instead;
but no matter, even m_start()'s call just after get_task_mm() is racy.)
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Some versions of libc can't deal with a VDSO which doesn't have its
ELF headers matching its mapped address. COMPAT_VDSO maps the VDSO at
a specific system-wide fixed address. Previously this was all done at
build time, on the grounds that the fixed VDSO address is always at
the top of the address space. However, a hypervisor may reserve some
of that address space, pushing the fixmap address down.
This patch does the adjustment dynamically at runtime, depending on
the runtime location of the VDSO fixmap.
[ Patch has been through several hands: Jan Beulich wrote the orignal
version; Zach reworked it, and Jeremy converted it to relocate phdrs
as well as sections. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
identify_cpu() is used to identify both the boot CPU and secondary
CPUs, but it performs some actions which only apply to the boot CPU.
Those functions are therefore really __init functions, but because
they're called by identify_cpu(), they must be marked __cpuinit.
This patch splits identify_cpu() into identify_boot_cpu() and
identify_secondary_cpu(), and calls the appropriate init functions
from each. Also, identify_boot_cpu() and all the functions it
dominates are marked __init.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Most of asm-i386/bugs.h is code which should be in a C file, so put it there.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Under CONFIG_DISCONTIGMEM, assuming that a !pfn_valid() implies all
subsequent pfn-s are also invalid is wrong. Thus replace this by
explicitly checking against the E820 map.
AK: make e820 on x86-64 not initdata
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
machine_ops is an interface for the machine_* functions defined in
<linux/reboot.h>. This is intended to allow hypervisors to intercept
the reboot process, but it could be used to implement other x86
subarchtecture reboots.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Add a smp_ops interface. This abstracts the API defined by
<linux/smp.h> for use within arch/i386. The primary intent is that it
be used by a paravirtualizing hypervisor to implement SMP, but it
could also be used by non-APIC-using sub-architectures.
This is related to CONFIG_PARAVIRT, but is implemented unconditionally
since it is simpler that way and not a highly performance-sensitive
interface.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Now we have an explicit per-cpu GDT variable, we don't need to keep the
descriptors around to use them to find the GDT: expose cpu_gdt directly.
We could go further and make load_gdt() pack the descriptor for us, or even
assume it means "load the current cpu's GDT" which is what it always does.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 5e518d7672 did the same change to
i386's variant.
With this change, i386's and x86-64's versions are identical, raising
the question whether the x86-64 one should go (just like there's only
one instance of edd.S).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch touches the NMI watchdog every MAX_ORDER_NR_PAGES
to inhibit the machine from triggering an NMI while the CPUs
are locked. This situation is happening on boxes with more
than 64CPUs and 128GB of RAM when Alt-SysRq-m is performed.
It has been succesfully tested for regression on uni, 2, 4, 8
32, and 64 CPU boxes with various memory configuration.
Signed-off-by: Andi Kleen <ak@suse.de>
Current vsyscall_gtod_data is large (3 or 4 cache lines dirtied at timer
interrupt). We can shrink it to exactly 64 bytes (1 cache line on AMD64)
Instead of copying a whole struct clocksource, we copy only needed fields.
I deleted an unused field : offset_base
This patch fixes one oddity in vgettimeofday(): It can returns a timeval with
tv_usec = 1000000. Maybe not a bug, but why not doing the right thing ?
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>
There is a tiny probability that the return value from vtime(time_t *t) is
Signed-off-by: Andi Kleen <ak@suse.de>
different than the value stored in *t
Using a temporary variable solves the problem and gives a faster code.
17: 48 85 ff test %rdi,%rdi
1a: 48 8b 05 00 00 00 00 mov 0(%rip),%rax #
__vsyscall_gtod_data.wall_time_tv.tv_sec
21: 74 03 je 26
23: 48 89 07 mov %rax,(%rdi)
26: c9 leaveq
27: c3 retq
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Many years ago, UNEXPECTED_IO_APIC() contained printk()'s (but nothing more).
Now that it's completely empty for years, we can as well remove it.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
- there's no reason for duplicating the prototype from
include/linux/syscalls.h in include/asm-x86_64/unistd.h
- every file should #include the headers containing the prototypes for
it's global functions
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
x86_64 currently simulates a list using the index and private fields of the
page struct. Seems that the code was inherited from i386. But x86_64 does
not use the slab to allocate pgds and pmds etc. So the lru field is not
used by the slab and therefore available.
This patch uses standard list operations on page->lru to realize pgd
tracking.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On x86-64, kernel memory freed after init can be entirely unmapped instead
of just getting 'poisoned' by overwriting with a debug pattern.
On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
can also be write-protected.
Compared to the first version, this one prevents re-creating deleted
mappings in the kernel image range on x86-64, if those got removed
previously. This, together with the original changes, prevents temporarily
having inconsistent mappings when cacheability attributes are being
changed on such pages (e.g. from AGP code). While on i386 such duplicate
mappings don't exist, the same change is done there, too, both for
consistency and because checking pte_present() before using various other
pte_XXX functions is a requirement anyway. At once, i386 code gets
adjusted to use pte_huge() instead of open coding this.
AK: split out cpa() changes
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Fix various broken corner cases in i386 and x86-64 change_page_attr.
AK: split off from tighten kernel image access rights
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
paravirt.c used to implement native versions of all low-level
functions. Far cleaner is to have the native versions exposed in the
headers and as inline native_XXX, and if !CONFIG_PARAVIRT, then simply
#define XXX native_XXX.
There are several nice side effects:
1) write_dt_entry() now takes the correct "struct Xgt_desc_struct *"
not "void *".
2) load_TLS is reintroduced to the for loop, not manually unrolled
with a #error in case the bounds ever change.
3) Macros become inlines, with type checking.
4) Access to the native versions is trivial for KVM, lguest, Xen and
others who might want it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We now have cpu_init() and secondary_cpu_init() doing nothing but calling
_cpu_init() with the same arguments. Rename _cpu_init() to cpu_init() and use
it as a replcement for secondary_cpu_init().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now we are no longer dynamically allocating the GDT, we don't need the
"cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" to the
per-cpu GDT. This means initializing the cpu_gdt array in C.
The boot CPU uses the per-cpu var directly, then in smp_prepare_cpus() it
switches to the per-cpu copy just allocated. For secondary CPUs, the
early_gdt_descr is set to point directly to their per-cpu copy.
For UP the code is very simple: it keeps using the "per-cpu" GDT as per SMP,
but we never have to move.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allocating PDA and GDT at boot is a pain. Using simple per-cpu variables adds
happiness (although we need the GDT page-aligned for Xen, which we do in a
followup patch).
[akpm@linux-foundation.org: build fix]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Because the command line is increased to 2048 characters after 2.6.21, it's
not possible for boot loaders and userspace tools to determine the length
of the command line the kernel can understand. The benefit of knowing the
length is that users can be warned if the command line size is too long
which prevents surprise if things don't work after bootup.
This patch updates the boot protocol to contain a field called
"cmdline_size" that contain the length of the command line (excluding the
terminating zero).
The patch also adds missing fields (of protocol version 2.05) to the x86_64
setup code.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Alon Bar-Lev <alon.barlev@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The lguest patches somehow managed to trigger this:
In file included from arch/i386/lguest/lguest.c:38:
include/asm/asm-offsets.h:67:1: warning: "VDSO_PRELINK" redefined
In file included from include/linux/elf.h:7,
from include/linux/module.h:15,
from include/linux/device.h:21,
from include/linux/interrupt.h:15,
from arch/i386/lguest/lguest.c:27:
include/asm/elf.h:140:1: warning: this is the location of the previous definition
I assume that using the same identifier twice was a bad idea..
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
remove the reporting of the constant_tsc flag from the "power management"
field in /proc/cpuinfo. The NULL value there was replaced by "" because
the former would result in a printout of [8] if the flag is set.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Extends the numa=fake x86_64 command-line option to split the remaining system
memory into nodes of fixed size. Any leftover memory is allocated to a final
node unless the command-line ends with a comma.
For example:
numa=fake=2*512,*128 gives two 512M nodes and the remaining system
memory is split into nodes of 128M each.
This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the size of the remaining nodes to be allocated is
known based on their capacity for resource management.
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extends the numa=fake x86_64 command-line option to split the remaining
system memory into equal-sized nodes.
For example:
numa=fake=2*512,4* gives two 512M nodes and the remaining system
memory is split into four approximately equal
chunks.
This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the granularity with which nodes shall be allocated
is known.
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes. These nodes can be used in conjunction with cpusets for coarse
memory resource management.
The old command-line option is still supported:
numa=fake=32 gives 32 fake NUMA nodes, ignoring the NUMA setup of the
actual machine.
But now you may configure your system for the node sizes of your choice:
numa=fake=2*512,1024,2*256
gives two 512M nodes, one 1024M node, two 256M nodes, and
the rest of system memory to a sixth node.
The existing hash function is maintained to support the various node sizes
that are possible with this implementation.
Each node of the same size receives roughly the same amount of available
pages, regardless of any reserved memory with its address range. The total
available pages on the system is calculated and divided by the number of equal
nodes to allocate. These nodes are then dynamically allocated and their
borders extended until such time as their number of available pages reaches
the required size.
Configurable node sizes are recommended when used in conjunction with cpusets
for memory control because it eliminates the overhead associated with scanning
the zonelists of many smaller full nodes on page_alloc().
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix comments to represent the true number of quadwords in GDT.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch makes the needlessly global vmi_pmd_clear() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change mark_tsc_unstable() so it takes a string argument, which holds the
reason the TSC was marked unstable.
This is then displayed the first time mark_tsc_unstable is called.
This should help us better debug why the TSC was marked unstable on certain
systems and allow us to make sure we're not being overly paranoid when
throwing out this troublesome clocksource.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Work around a warning with -Wmissing-prototypes in
arch/i386/kernel/asm-offsets.c
The warning isn't gcc's fault - asm-offsets.c is simply a special file.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
clean up unneeded type cast by properly declare data type.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
o Modpost generates warnings for i386 if compiled with CONFIG_RELOCATABLE=y
WARNING: vmlinux - Section mismatch: reference to .init.text:find_unisys_acpi_oem_table from .text between 'acpi_madt_oem_check' (at offset 0xc0101eda) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:acpi_get_table_header_early from .text between 'acpi_madt_oem_check' (at offset 0xc0101ef0) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'acpi_madt_oem_check' (at offset 0xc0101f2e) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:setup_unisys from .text between 'acpi_madt_oem_check' (at offset 0xc0101f37) and 'enable_apic_mode'WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'mps_oem_check' (at offset 0xc0101ec7) and 'acpi_madt_oem_check'
WARNING: vmlinux - Section mismatch: reference to .init.text:es7000_sw_apic from .text between 'enable_apic_mode' (at offset 0xc0101f48) and 'check_apicid_present'
o Some functions which are inline (acpi_madt_oem_check) are not inlined by
compiler as these functions are accessed using function pointer. These
functions are put in .text section and they in-turn access __init type
functions hence modpost generates warnings.
o Do not iniline acpi_madt_oem_check, instead make it __init.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When compiling with -Os (which is default) the compiler defaults to it
anyways. And with -O2 it probably generates somewhat better (although
also larger) code.
Signed-off-by: Andi Kleen <ak@suse.de>
o This patch moves the code to verify long mode and SSE to a common file.
This code is now shared by trampoline.S, wakeup.S, boot/setup.S and
boot/compressed/head.S
o So far we used to do very limited check in trampoline.S, wakeup.S and
in 32bit entry point. Now all the entry paths are forced to do the
exhaustive check, including SSE because verify_cpu is shared.
o I am keeping this patch as last in the x86 relocatable series because
previous patches have got quite some amount of testing done and don't want
to distrub that. So that if there is problem introduced by this patch, at
least it can be easily isolated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Extend the bzImage protocol (same as i386) to allow bzImage loaders to
load the protected mode kernel at non-1MB address. Now protected mode
component is relocatable and can be loaded at non-1MB addresses.
o As of today kdump uses it to run a second kernel from a reserved memory
area.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o X86_64 kernel should run from 2MB aligned address for two reasons.
- Performance.
- For relocatable kernels, page tables are updated based on difference
between compile time address and load time physical address.
This difference should be multiple of 2MB as kernel text and data
is mapped using 2MB pages and PMD should be pointing to a 2MB
aligned address. Life is simpler if both compile time and load time
kernel addresses are 2MB aligned.
o Flag the error at compile time if one is trying to build a kernel which
does not meet alignment restrictions.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch modifies the x86_64 kernel so that it can be loaded and run
at any 2M aligned address, below 512G. The technique used is to
compile the decompressor with -fPIC and modify it so the decompressor
is fully relocatable. For the main kernel the page tables are
modified so the kernel remains at the same virtual address. In
addition a variable phys_base is kept that holds the physical address
the kernel is loaded at. __pa_symbol is modified to add that when
we take the address of a kernel symbol.
When loaded with a normal bootloader the decompressor will decompress
the kernel to 2M and it will run there. This both ensures the
relocation code is always working, and makes it easier to use 2M
pages for the kernel and the cpu.
AK: changed to not make RELOCATABLE default in Kconfig
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Currently __pa_symbol is for use with symbols in the kernel address
map and __pa is for use with pointers into the physical memory map.
But the code is implemented so you can usually interchange the two.
__pa which is much more common can be implemented much more cheaply
if it is it doesn't have to worry about any other kernel address
spaces. This is especially true with a relocatable kernel as
__pa_symbol needs to peform an extra variable read to resolve
the address.
There is a third macro that is added for the vsyscall data
__pa_vsymbol for finding the physical addesses of vsyscall pages.
Most of this patch is simply sorting through the references to
__pa or __pa_symbol and using the proper one. A little of
it is continuing to use a physical address when we have it
instead of recalculating it several times.
swapper_pgd is now NULL. leave_mm now uses init_mm.pgd
and init_mm.pgd is initialized at boot (instead of compile time)
to the physmem virtual mapping of init_level4_pgd. The
physical address changed.
Except for the for EMPTY_ZERO page all of the remaining references
to __pa_symbol appear to be during kernel initialization. So this
should reduce the cost of __pa in the common case, even on a relocated
kernel.
As this is technically a semantic change we need to be on the lookout
for anything I missed. But it works for me (tm).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o __pa() should be used only on kernel linearly mapped virtual addresses
and not on kernel text and data addresses.
o Hibernation code needs to determine the physical address associated
with kernel symbol to mark a section boundary which contains pages which
don't have to be saved and restored during hibernate/resume operation.
o Move this piece of code in arch dependent section. So that architectures
which don't have kernel text/data mapped into kernel linearly mapped
region can come up with their own ways of determining physical addresses
associated with a kernel text.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
With the rewrite of the SMP trampoline and the early page
allocator there is nothing that needs identity mapped pages,
once we start executing C code.
So add zap_identity_mappings into head64.c and remove
zap_low_mappings() from much later in the code. The functions
are subtly different thus the name change.
This also kills boot_level4_pgt which was from an earlier
attempt to move the identity mappings as early as possible,
and is now no longer needed. Essentially I have replaced
boot_level4_pgt with trampoline_level4_pgt in trampoline.S
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Moved wakeup_level4_pgt into the wakeup routine so we can
run the kernel above 4G.
o Now we first go to 64bit mode and continue to run from trampoline and
then then start accessing kernel symbols and restore processor context.
This enables us to resume even in relocatable kernel context when
kernel might not be loaded at physical addr it has been compiled for.
o Removed the need for modifying any existing kernel page table.
o Increased the size of the wakeup routine to 8K. This is required as
wake page tables are on trampoline itself and they got to be at 4K
boundary, hence one page is not sufficient.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Various cleanups. One of the main purpose of cleanups is that make
wakeup.S as close as possible to trampoline.S.
o Following are the changes
- Indentations for comments.
- Changed the gdt table to compact form and to resemble the
one in trampoline.S
- Take the jump to 32bit from real mode using ljmpl. Makes code
more readable.
- After enabling long mode, directly take a long jump for 64bit
mode. No need to take an extra jump to "reach_comaptibility_mode"
- Stack is not used after real mode. So don't load stack in
32 bit mode.
- No need to enable PGE here.
- No need to do extra EFER read, anyway we trash the read contents.
- No need to enable system call (EFER_SCE). Anyway it will be
enabled when original EFER is restored.
- No need to set MP, ET, NE, WP, AM bits in cr0. Very soon we will
reload the original cr0 while restroing the processor state.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Use appropriate names for 64bit regsiters.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Get rid of dead code in wakeup.S
o We never restore from saved_gdt, saved_idt, saved_ltd, saved_tss, saved_cr3,
saved_cr4, saved_cr0, real_save_gdt, saved_efer, saved_efer2. Get rid
of of associated code.
o Get rid of bogus_magic, bogus_31_magic and bogus_magic2. No longer being
used.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This modifies the SMP trampoline and all of the associated code so
it can jump to a 64bit kernel loaded at an arbitrary address.
The dependencies on having an idenetity mapped page in the kernel
page tables for SMP bootup have all been removed.
In addition the trampoline has been modified to verify
that long mode is supported. Asking if long mode is implemented is
down right silly but we have traditionally had some of these checks,
and they can't hurt anything. So when the totally ludicrous happens
we just might handle it correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
EFER varies like %cr4 depending on the cpu capabilities, and which cpu
capabilities we want to make use of. So save/restore it make certain
we have the same EFER value when we are done.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Move __KERNEL32_CS up into the unused gdt entry. __KERNEL32_CS is
used when entering the kernel so putting it first is useful when
trying to keep boot gdt sizes to a minimum.
Set the accessed bit on all gdt entries. We don't care
so there is no need for the cpu to burn the extra cycles,
and it potentially allows the pages to be immutable. Plus
it is confusing when debugging and your gdt entries mysteriously
change.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Use virtual addresses instead of physical addresses
in copy bootdata. In addition fix the implementation
of the old bootloader convention. Everything is
at real_mode_data always. It is just that sometimes
real_mode_data was relocated by setup.S to not sit at
0x90000.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
- Merge physmem_pgt and ident_pgt, removing physmem_pgt. The merge
is broken as soon as mm/init.c:init_memory_mapping is run.
- As physmem_pgt is gone don't export it in pgtable.h.
- Use defines from pgtable.h for page permissions.
- Fix the physical memory identity mapping so it is at the correct
address.
- Remove the physical memory mapping from wakeup_level4_pgt it
is at the wrong address so we can't possibly be usinging it.
- Simply NEXT_PAGE the work to calculate the phys_ alias
of the labels was very cool. Unfortuantely it was a brittle
special purpose hack that makes maitenance more difficult.
Instead just use label - __START_KERNEL_map like we do
everywhere else in assembly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Early in the boot process we need the ability to set
up temporary mappings, before our normal mechanisms are
initialized. Currently this is used to map pages that
are part of the page tables we are building and pages
during the dmi scan.
The core problem is that we are using the user portion of
the page tables to implement this. Which means that while
this mechanism is active we cannot catch NULL pointer dereferences
and we deviate from the normal ways of handling things.
In this patch I modify early_ioremap to map pages into
the kernel portion of address space, roughly where
we will later put modules, and I make the discovery of
which addresses we can use dynamic which removes all
kinds of static limits and remove the dependencies
on implementation details between different parts of the code.
Now alloc_low_page() and unmap_low_page() use
early_iomap() and early_iounmap() to allocate/map and
unmap a page.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
The dma_ops structure can be const since it never changes
after boot.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
smp_call_function and smp_call_function_single are almost complete
duplicates of the same logic. This patch combines them by
implementing them in terms of the more general
smp_call_function_mask().
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Stephane Eranian <eranian@hpl.hp.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@elte.hu>
Hi!
I sent this simple patch to lkml about two weeks ago and also cc'ed
to Linus, but seems that the patch got ignored. I decided to write to
you, because you have modified the relevant file most recently.
Below is a copy of the mail that is also available at
<http://lkml.org/lkml/2007/2/28/230>.
Signed-off-by: Andi Kleen <ak@suse.de>
The reboot_fixups stuff seems to be a bit of a mess, specifically the
header is in linux/ when its a purely i386-specific piece of code. I'm
not sure why it has its config option; its only currently needed for
"geode-gx1/cs5530a", so perhaps whatever config option controls that
hardware should enable this?
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
The kernel only supports gcc 3.2+ now so it doesn't make sense
anymore to explicitely check for options this compiler version
already has.
This actually fixes a bug. The -mprefered-stack-boundary check
never worked because gcc rightly complains
CC arch/i386/kernel/asm-offsets.s
cc1: -mpreferred-stack-boundary=2 is not between 4 and 12
We just never saw the error because of cc-options.
I changed it to 4 to actually work.
Tested by compiling i386 and x86-64 defconfig with gcc 3.2.
Should speed up the build time a tiny bit and improve
stack usage on i386 slightly.
Signed-off-by: Andi Kleen <ak@suse.de>
Change sysenter_setup to __cpuinit.
Change __INIT & __INITDATA to be cpu hotplug aware.
Resolve MODPOST warnings similar to:
WARNING: vmlinux - Section mismatch: reference to .init.text:sysenter_setup from
.text between 'identify_cpu' (at offset 0xc040a380) and 'detect_ht'
and
WARNING: vmlinux - Section mismatch: reference to .init.data:vsyscall_int80_end
from .text between 'sysenter_setup' (at offset 0xc041a269) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_int80_start from .text between 'sysenter_setup' (at offset
0xc041a26e) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_sysenter_end from .text between 'sysenter_setup' (at offset
0xc041a275) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_sysenter_start from .text between 'sysenter_setup' (at
offset 0xc041a27a) and 'enable_sep_cpu'
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Add __init to probe_bigsmp. All callers are __init and data being examined
is __initdata.
Resolves MODPOST warning similar to:
WARNING: vmlinux - Section mismatch: reference to .init.data: from .text between 'probe_bigsmp' (at offset 0xc0401e56) and 'init_apic_ldr'
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Hello,
This patch against 2.6.20-git14 makes the NMI watchdog use PERFSEL1/PERFCTR1
instead of PERFSEL0/PERFCTR0 on processors supporting Intel architectural
perfmon, such as Intel Core 2. Although all PMU events can work on
both counters, the Precise Event-Based Sampling (PEBS) requires that the
event be in PERFCTR0 to work correctly (see section 18.14.4.1 in the
IA32 SDM Vol 3b). This versions has 3 chunks compared to previous where
we had missed on check.
Changelog:
- make the x86-64 NMI watchdog use PERFSEL1/PERFCTR1 instead of PERFSEL0/PERFCTR0
on processors supporting the Intel architectural perfmon (e.g. Core 2 Duo).
This allows PEBS to work when the NMI watchdog is active.
signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Hello,
This patch against 2.6.20-git14 makes the NMI watchdog use PERFSEL1/PERFCTR1
instead of PERFSEL0/PERFCTR0 on processors supporting Intel architectural
perfmon, such as Intel Core 2. Although all PMU events can work on
both counters, the Precise Event-Based Sampling (PEBS) requires that the
event be in PERFCTR0 to work correctly (see section 18.14.4.1 in the
IA32 SDM Vol 3b).
A similar patch for x86-64 is to follow.
Changelog:
- make the i386 NMI watchdog use PERFSEL1/PERFCTR1 instead of PERFSEL0/PERFCTR0
on processors supporting the Intel architectural perfmon (e.g. Core 2 Duo).
This allows PEBS to work when the NMI watchdog is active.
signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
a userspace fault or a kernelspace fault which will result in the
immediate death of the process. They should not be filled in as a
result of a kernelspace fault which can be fixed up.
Otherwise, if the process is handling SIGSEGV and examining the fault
information, this can result in the kernel space fault trashing the
previously stored fault information if it arrives between the
userspace fault happening and the SIGSEGV being delivered to the process.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Jan Beulich <jbeulich@novell.com>
--
arch/i386/kernel/traps.c | 24 ++++++++++++++++++------
arch/x86_64/kernel/traps.c | 30 +++++++++++++++++++++++-------
2 files changed, 41 insertions(+), 13 deletions(-)
Remove the assumption that if the first page of a legacy ROM is mapped,
it'll all be mapped. This'll also stop people reading this code from
wondering if they're looking at a bug...
Signed-off-by: Rene Herman <rene.herman@gmail.com>
Signed-off-by: Martin Murray <murrayma@citi.umich.edu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The VIA C7 is a 686 (with TSC) that supports MMX, SSE and SSE2, it also has
a cache line length of 64 according to
http://www.digit-life.com/articles2/cpu/rmma-via-c7.html. This patch sets
gcc to -march=686 and select s the correct cache shift.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eliminated the arch/i386/kernel/timers in 2.6.18, use clocksoures instead.
pit_latch_buggy was referred in timers/timer_tsc.c, and currently removed.
Therefore nobody refer it.
Until 2.6.17, MediaGX's TSC works correctly. after 2.6.18, warned "TSC
appears to be running slowly. Marking it as unstable". So marked unstable
TSC when CS55x0.
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Whether a region is below 1Mb is determined by its start rather than
its end.
This hunk got erroneously dropped from a previous patch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No need to use -traditional for processing asm in i386/kernel/
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Synchronize i386's smp_send_stop() with x86-64's in only try-locking
the call lock to prevent deadlocks when called from panic().
In both version, disable interrupts before clearing the CPU off the
online map to eliminate races with IRQ handlers inspecting this map.
Also in both versions, save/restore interrupts rather than disabling/
enabling them.
On x86-64, eliminate one function used here by folding it into its
single caller, convert to static, and rename for consistency with i386
(lkcd may like this).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Avoid including asm/vsyscall32.h in virtually every source file.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Move inclusion of asm/fixmap.h to where it is really used rather than
where it may have been used long ago (requires a few other adjustments
to includes due to previous implicit dependencies).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
- make the page table contents printing PAE capable
- make sure the address stored in current->thread.cr2 is unmodified
from what was read from CR2
- don't call oops_may_print() multiple times, when one time suffices
- print pte even in highpte case, as long as the pte page isn't in
actually in high memory (which is specifically the case for all page
tables covering kernel space)
(Changes to v3: Use sizeof()*2 rather than the suggested sizeof()*4 for
printing width, use fixed 16-nibble width for PAE, and also apply the
max_low_pfn range check to the middle level lookup on PAE.)
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Remove clustered APIC mode. There's little point in the use of clustered APIC
mode, broadcasting is limited to within the cluster only, and chipsets have
bugs in this area as well. So default to physical APIC mode when the CPU
count is large, and default to logical APIC mode when the CPU count is 8 or
smaller.
(this patch only removes the use of genapic_cluster and cleans up the
resulting genapic.c file - removal of all remaining traces of clustered
mode will be done by another patch.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Fix a couple of inconsistencies/problems I found while reviewing the x86_64
genapic code (when I was chasing mysterious eth0 timeouts that would only
trigger if CPU_HOTPLUG is enabled):
- AMD systems defaulted to the slower flat-physical mode instead
of the flat-logical mode. The only restriction on AMD systems
is that they should not use clustered APIC mode.
- removed the CPU hotplug hacks, switching the default for small
systems back from phys-flat to logical-flat. The switching to logical
flat mode on small systems fixed sporadic ethernet driver timeouts i
was getting on a dual-core Athlon64 system:
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timeout, status 0c 0005 c07f media 80.
eth0: Tx queue start entry 32 dirty entry 28.
eth0: Tx descriptor 0 is 0008a04a. (queue head)
eth0: Tx descriptor 1 is 0008a04a.
eth0: Tx descriptor 2 is 0008a04a.
eth0: Tx descriptor 3 is 0008a04a.
eth0: link up, 100Mbps, full-duplex, lpa 0xC5E1
- The use of '<= 8' was a bug by itself (the valid APIC ids
for logical flat mode go from 0 to 7, not 0 to 8). The new logic
is to use logical flat mode on both AMD and Intel systems, and
to only switch to physical mode when logical mode cannot be used.
If CPU hotplug is racy wrt. APIC shutdown then CPU hotplug needs
fixing, not the whole IRQ system be made inconsistent and slowed
down.
- minor cleanups: simplified some code constructs
build & booted on a couple of AMD and Intel SMP systems.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>