Commit Graph

35594 Commits

Author SHA1 Message Date
Izik Eidus
448353caea KVM: MMU: mark pages that were inserted to the shadow pages table as accessed
Mark guest pages as accessed when removed from the shadow page tables for
better lru processing.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:15 +02:00
Avi Kivity
eb9774f0d6 KVM: Remove misleading check for mmio during event injection
mmio was already handled in kvm_arch_vcpu_ioctl_run(), so no need to check
again.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:15 +02:00
Avi Kivity
f21b8bf4cc KVM: x86 emulator: address size and operand size overrides are sticky
Current implementation is to toggle, which is incorrect.  Patch ported from
corresponding Xen code.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:14 +02:00
Guillaume Thouvenin
90e0a28f6b KVM: x86 emulator: Make a distinction between repeat prefixes F3 and F2
cmps and scas instructions accept repeat prefixes F3 and F2. So in
order to emulate those prefixed instructions we need to be able to know
if prefixes are REP/REPE/REPZ or REPNE/REPNZ. Currently kvm doesn't make
this distinction. This patch introduces this distinction.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:14 +02:00
Zhang Xiantao
e9f85cde99 KVM: Portability: Move unalias_gfn to arch dependent file
Non-x86 archs don't need this mechanism. Move it to arch, and
keep its interface in common.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:14 +02:00
Sheng Yang
83ff3b9d4a KVM: VMX: Remove the secondary execute control dependency on irqchip
The state of SECONDARY_VM_EXEC_CONTROL shouldn't depend on in-kernel IRQ chip,
this patch fix this.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:14 +02:00
Dan Kenigsberg
0771671749 KVM: Enhance guest cpuid management
The current cpuid management suffers from several problems, which inhibit
passing through the host feature set to the guest:

 - No way to tell which features the host supports

  While some features can be supported with no changes to kvm, others
  need explicit support.  That means kvm needs to vet the feature set
  before it is passed to the guest.

 - No support for indexed or stateful cpuid entries

  Some cpuid entries depend on ecx as well as on eax, or on internal
  state in the processor (running cpuid multiple times with the same
  input returns different output).  The current cpuid machinery only
  supports keying on eax.

 - No support for save/restore/migrate

  The internal state above needs to be exposed to userspace so it can
  be saved or migrated.

This patch adds extended cpuid support by means of three new ioctls:

 - KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm)
   supports

 - KVM_SET_CPUID2: sets the vcpu's cpuid table

 - KVM_GET_CPUID2: gets the vcpu's cpuid table, including hidden state

[avi: fix original KVM_SET_CPUID not removing nx on non-nx hosts as it did
      before]

Signed-off-by: Dan Kenigsberg <danken@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:13 +02:00
Avi Kivity
6d4e4c4fca KVM: Disallow fork() and similar games when using a VM
We don't want the meaning of guest userspace changing under our feet.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:13 +02:00
Avi Kivity
76c35c6e99 KVM: MMU: Rename 'release_page'
Rename the awkwardly named variable.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:12 +02:00
Avi Kivity
4db3531487 KVM: MMU: Rename variables of type 'struct kvm_mmu_page *'
These are traditionally named 'page', but even more traditionally, that name
is reserved for variables that point to a 'struct page'.  Rename them to 'sp'
(for "shadow page").

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:12 +02:00
Avi Kivity
1d28f5f4a4 KVM: Remove gpa_to_hpa()
Converting last uses along the way.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:12 +02:00
Avi Kivity
0d81f2966a KVM: MMU: Remove gva_to_hpa()
No longer used.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
3f3e7124f6 KVM: MMU: Simplify nonpaging_map()
Instead of passing an hpa, pass a regular struct page.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
1755fbcc66 KVM: MMU: Introduce gfn_to_gpa()
Converting a frame number to an address is tricky since the data type changes
size.  Introduce a function to do it.  This fixes an actual bug when
accessing guest ptes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
38c335f1f5 KVM: MMU: Adjust page_header_update_slot() to accept a gfn instead of a gpa
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
230c9a8f23 KVM: MMU: Merge set_pte() and set_pte_common()
Since set_pte() is now the only caller of set_pte_common(), merge the two
functions.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
050e64992f KVM: MMU: Remove set_pde()
It is now identical to set_pte().

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
4e542370c7 KVM: MMU: Remove extra gaddr parameter from set_pte_common()
Similar information is available in the gfn parameter, so use that.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:11 +02:00
Avi Kivity
da928521b7 KVM: MMU: Move pse36 handling to the guest walker
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Avi Kivity
5fb07ddb18 KVM: MMU: Introduce and use gpte_to_gfn()
Instead of repretitively open-coding this.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Izik Eidus
b238f7bc2d KVM: MMU: Code cleanup
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Avi Kivity
d835dfecd0 KVM: Don't bother the mmu if cr3 load doesn't change cr3
If the guest requests just a tlb flush, don't take the vm lock and
drop the mmu context pointlessly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Avi Kivity
79539cec0c KVM: MMU: Avoid unnecessary remote tlb flushes when guest updates a pte
If all we're doing is increasing permissions on a pte (typical for demand
paging), then there's not need to flush remote tlbs.  Worst case they'll
get a spurious page fault.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Avi Kivity
0f74a24c59 KVM: Add statistic for remote tlb flushes
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:10 +02:00
Avi Kivity
e5a4c8cad9 KVM: MMU: Implement guest page fault bypass for nonpae
I spent an hour worrying why I see so many guest page faults on FC6 i386.
Turns out bypass wasn't implemented for nonpae.  Implement it so it doesn't
happen again.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Avi Kivity
26e5215fdc KVM: Split vcpu creation to avoid vcpu_load() before preemption setup
Split kvm_arch_vcpu_create() into kvm_arch_vcpu_create() and
kvm_arch_vcpu_setup(), enabling preemption notification between the two.
This mean that we can now do vcpu_load() within kvm_arch_vcpu_setup().

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Zhang Xiantao
0de10343b3 KVM: Portability: Split kvm_set_memory_region() to have an arch callout
Moving !user_alloc case to kvm_arch to avoid unnecessary
code logic in non-x86 platform.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Zhang Xiantao
3ad82a7e87 KVM: Recalculate mmu pages needed for every memory region change
Instead of incrementally changing the mmu cache size for every memory slot
operation, recalculate it from scratch.  This is simpler and safer.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Avi Kivity
6226686954 KVM: x86 emulator: prefetch up to 15 bytes of the instruction executed
Instead of fetching one byte at a time, prefetch 15 bytes (or until the next
page boundary) to avoid guest page table walks.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Avi Kivity
93a0039c8d KVM: x86 emulator: retire ->write_std()
Theoretically used to acccess memory known to be ordinary RAM, it was
never implemented.  It is questionable whether it is possible to implement
it correctly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Izik Eidus
b4231d6180 KVM: MMU: Selectively set PageDirty when releasing guest memory
Improve dirty bit setting for pages that kvm release, until now every page
that we released we marked dirty, from now only pages that have potential
to get dirty we mark dirty.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:09 +02:00
Izik Eidus
2065b3727e KVM: MMU: Fix potential memory leak with smp real-mode
When we map a page, we check whether some other vcpu mapped it for us and if
so, bail out.  But we should decrease the refcount on the page as we do so.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:08 +02:00
Hollis Blanchard
7faa8f6fcc KVM: Move misplaced comment
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:07 +02:00
Hollis Blanchard
d40ccc6246 KVM: Correct consistent typo: "destory" -> "destroy"
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:07 +02:00
Hollis Blanchard
00fc9f5ae5 KVM: Remove unused "rmap_overflow" variable
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:07 +02:00
Avi Kivity
971535ff65 KVM: MMU: Remove unused variable
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Izik Eidus
3e021bf505 KVM: Simplify kvm_clear_guest_page()
Use kvm_write_guest_page() with empty_zero_page, instead of doing
kmap and memset.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Izik Eidus
ec8d4eaefc KVM: MMU: Change guest pte access to kvm_{read,write}_guest()
Things are simpler and more regular this way.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Jan Kiszka
15b00f32d5 KVM: VMX: Force seg.base == (seg.sel << 4) in real mode
Ensure that segment.base == segment.selector << 4 when entering the real
mode on Intel so that the CPU will not bark at us.  This fixes some old
protected mode demo from http://www.x86.org/articles/pmbasics/tspec_a1_doc.htm.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Zhang Xiantao
54f1585a8d KVM: Portability: Move some function declarations to x86.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Zhang Xiantao
ec6d273deb KVM: Move some static inline functions out from kvm.h into x86.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:06 +02:00
Zhang Xiantao
2b3ccfa0c5 KVM: Portability: Move vcpu regs enumeration definition to x86.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Zhang Xiantao
ea4a5ff80c KVM: Portability: Move struct kvm_x86_ops definition to x86.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Zhang Xiantao
cd6e8f87ef KVM: Portability: Move some macro definitions from kvm.h to x86.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Zhang Xiantao
56c6d28a9a KVM: Portability: MMU initialization and teardown split
Move out kvm_mmu init and exit functionality from kvm_main.c

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Zhang Xiantao
5bb064dcde KVM: Portability: Move kvm_vcpu_ioctl_get_dirty_log to arch-specific file
Meanwhile keep the interface in common, and leave as more logic in common
as possible.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Amit Shah
9327fd1195 KVM: Make unloading of FPU state when putting vcpu arch-independent
Instead of having each architecture do it individually, we
do this in the arch-independent code (just x86 as of now).

[avi: add svm to the mix, which was added to mainline during the
 2.6.24-rc process]

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:05 +02:00
Avi Kivity
4cee576493 KVM: MMU: Add some mmu statistics
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Avi Kivity
ba1389b7a0 KVM: Extend stats support for VM stats
This is in addition to the current virtual cpu statistics.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Avi Kivity
f2b5756bb3 KVM: Add instruction emulation statistics 2008-01-30 17:53:04 +02:00
Avi Kivity
f096ed8588 KVM: Add fpu_reload counter
Measure the number of times we switch the fpu state.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Avi Kivity
e1beb1d37c KVM: Replace 'light_exits' stat with 'host_state_reload'
This is a little more accurate (since it counts actual reloads, not potential
reloads), and reverses the sense of the statistic to measure a bad event like
most of the other stats (e.g. we want to minimize all counters).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Zhang Xiantao
d19a9cd275 KVM: Portability: Add two hooks to handle kvm_create and destroy vm
Add two arch hooks to handle kvm_create_vm and kvm destroy_vm. Now, just
put io_bus init and destory in common.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Zhang Xiantao
a16b043cc9 KVM: Remove __init attributes for kvm_init_debug and kvm_init_msr_list
Since their callers are not declared with __init.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:04 +02:00
Joe Perches
56919c5c97 KVM: Remove ptr comparisons to 0
Fix sparse warnings "Using plain integer as NULL pointer"

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Zhang Xiantao
8b0067913d KVM: Portability: Make kvm_vcpu_ioctl_translate arch dependent
Move kvm_vcpu_ioctl_translate to arch, since mmu would be put under arch.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Avi Kivity
e08aa78ae5 KVM: VMX: Consolidate register usage in vmx_vcpu_run()
We pass vcpu, vmx->fail, and vmx->launched to assembly code, but all three
are fields within vmx.  Consolidate by only passing in vmx and offsets for
the rest.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Zhang Xiantao
018d00d2fe KVM: Portability: move KVM_CHECK_EXTENSION
Make KVM_CHECK_EXTENSION code into a function, all archs can define its
capability independently.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Sheng Yang
a7e6c88a78 KVM: x86 emulator: modify 'lods', and 'stos' not to depend on CR2
The current 'lods' and 'stos' is depending on incoming CR2 rather than decode
memory address from registers.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Zhang Xiantao
f8c16bbaa9 KVM: Portability: Move x86 specific code from kvm_init() to kvm_arch()
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:03 +02:00
Zhang Xiantao
cb498ea2ce KVM: Portability: Combine kvm_init and kvm_init_x86
Will be called once arch module registers itself.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:02 +02:00
Zhang Xiantao
e9b11c1755 KVM: Portability: Add vcpu and hardware management arch hooks
Add the following hooks:

  void decache_vcpus_on_cpu(int cpu);
  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
  void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
  void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
  struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
  void kvm_arch_vcpu_destory(struct kvm_vcpu *vcpu);
  int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
  void kvm_arch_hardware_enable(void *garbage);
  void kvm_arch_hardware_disable(void *garbage);
  int kvm_arch_hardware_setup(void);
  void kvm_arch_hardware_unsetup(void);
  void kvm_arch_check_processor_compat(void *rtn);

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:02 +02:00
Zhang Xiantao
97896d04a1 KVM: Portability: Move kvm_x86_ops to x86.c
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:02 +02:00
Zhang Xiantao
d825ed0a97 KVM: Portability: Move some includes to x86.c
Move some includes to x86.c from kvm_main.c, since the related functions
have been moved to x86.c

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:02 +02:00
Izik Eidus
e0506bcba5 KVM: Change kvm_{read,write}_guest() to use copy_{from,to}_user()
This changes kvm_write_guest_page/kvm_read_guest_page to use
copy_to_user/read_from_user, as a result we get better speed
and better dirty bit tracking.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:02 +02:00
Izik Eidus
539cb6608c KVM: introduce gfn_to_hva()
Convert a guest frame number to the corresponding host virtual address.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Izik Eidus
f9d46eb0e4 KVM: add kvm_is_error_hva()
Check for the "error hva", an address outside the user address space that
signals a bad gfn.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Avi Kivity
1a6f4d7fbd KVM: Simplify CPU_TASKS_FROZEN cpu notifier handling
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Izik Eidus
906e608b05 KVM: x86 emulator: remove 8 bytes operands emulator for call near instruction
it is removed beacuse it isnt supported on a real host

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Eddie Dong
e5edaa01c4 KVM: VMX: wbinvd exiting
Add wbinvd VM Exit support to prepare for pass-through
device cache emulation and also enhance real time
responsiveness.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Eddie Dong
8a70cc3d0f KVM: VMX: Comment VMX primary/secondary exec ctl definitions
Add comments for secondary/primary Processor-Based VM-execution controls.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Avi Kivity
9c8cba3761 KVM: Fix faults during injection of real-mode interrupts
If vmx fails to inject a real-mode interrupt while fetching the interrupt
redirection table, it fails to record this in the vectoring information
field.  So we detect this condition and do it ourselves.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:01 +02:00
Avi Kivity
1155f76a81 KVM: VMX: Read & store IDT_VECTORING_INFO_FIELD
We'll want to write to it in order to fix real-mode irq injection problems,
but it is a read-only field.  Storing it in a variable solves that issue.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Avi Kivity
9c5623e3e4 KVM: VMX: Use vmx to inject real-mode interrupts
Instead of injecting real-mode interrupts by writing the interrupt frame into
guest memory, abuse vmx by injecting a software interrupt.  We need to
pretend the software interrupt instruction had a length > 0, so we have to
adjust rip backward.

This lets us not to mess with writing guest memory, which is complex and also
sleeps.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Dor Laor
12264760e4 KVM: Add make_page_dirty() to kvm_clear_guest_page()
Every write access to guest pages should be tracked.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Hollis Blanchard
b6c7a5dccf KVM: Portability: Move x86 vcpu ioctl handlers to x86.c
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Hollis Blanchard
d075206073 KVM: Portability: Move x86 FPU handling to x86.c
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Hollis Blanchard
8776e5194f KVM: Portability: Move x86 instruction emulation code to x86.c
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:00 +02:00
Hollis Blanchard
417bc3041f KVM: Portability: Make exported debugfs data architecture-specific
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Avi Kivity
1c73ef6650 KVM: x86 emulator: Hoist modrm and abs decoding into separate functions
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Uri Lublin
3b6fff198c KVM: Make mark_page_dirty() work for aliased pages too.
Recommended by Izik Eidus.

Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Avi Kivity
9f1ef3f8f5 KVM: Simplify decode_register_operand() calling convention
Now that rex_prefix is part of the decode cache, there is no need to pass
it along.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Avi Kivity
33615aa956 KVM: x86 emulator: centralize decoding of one-byte register access insns
Instructions like 'inc reg' that have the register operand encoded
in the opcode are currently specially decoded.  Extend
decode_register_operand() to handle that case, indicated by having
DstReg or SrcReg without ModRM.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Avi Kivity
3c118e24af KVM: x86 emulator: Extract the common code of SrcReg and DstReg
Share the common parts of SrcReg and DstReg decoding.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Carsten Otte
de7d789acd KVM: Portability: Move pio emulation functions to x86.c
This patch moves implementation of the following functions from
kvm_main.c to x86.c:
free_pio_guest_pages, vcpu_find_pio_dev, pio_copy_data, complete_pio,
kernel_pio, pio_string_write, kvm_emulate_pio, kvm_emulate_pio_string

The function inject_gp, which was duplicated by yesterday's patch
series, is removed from kvm_main.c now because it is not needed anymore.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:59 +02:00
Carsten Otte
bbd9b64e37 KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction

The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Carsten Otte
15c4a6406f KVM: Portability: Move kvm_get/set_msr[_common] to x86.c
This patch moves the implementation of the functions of kvm_get/set_msr,
kvm_get/set_msr_common, and set_efer from kvm_main.c to x86.c. The
definition of EFER_RESERVED_BITS is moved too.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Anthony Liguori
aab61cc0d2 KVM: Fix gfn_to_page() acquiring mmap_sem twice
KVM's nopage handler calls gfn_to_page() which acquires the mmap_sem when
calling out to get_user_pages().  nopage handlers are already invoked with the
mmap_sem held though.  Introduce a __gfn_to_page() for use by the nopage
handler which requires the lock to already be held.

This was noticed by tglx.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Sheng Yang
f78e0e2ee4 KVM: VMX: Enable memory mapped TPR shadow (FlexPriority)
This patch based on CR8/TPR patch, and enable the TPR shadow (FlexPriority)
for 32bit Windows.  Since TPR is accessed very frequently by 32bit
Windows, especially SMP guest, with FlexPriority enabled, we saw significant
performance gain.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Carsten Otte
a03490ed29 KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Carsten Otte
6866b83ed7 KVM: Portability: move get/set_apic_base to x86.c
This patch moves the implementation of get_apic_base and set_apic_base
from kvm_main.c to x86.c

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:58 +02:00
Carsten Otte
5fb76f9be1 KVM: Portability: Move memory segmentation to x86.c
This patch moves the definition of segment_descriptor_64 for AMD64 and
EM64T from kvm_main.c to segment_descriptor.h. It also adds a proper
#ifndef...#define...#endif around that header file.
The implementation of segment_base is moved from kvm_main.c to x86.c.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Carsten Otte
1fe779f8ec KVM: Portability: Split kvm_vm_ioctl v3
This patch splits kvm_vm_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
The patch is unchanged since last submission.

Common ioctls for all architectures are:
KVM_CREATE_VCPU, KVM_GET_DIRTY_LOG, KVM_SET_USER_MEMORY_REGION

x86 specific ioctls are:
KVM_SET_MEMORY_REGION,
KVM_GET/SET_NR_MMU_PAGES, KVM_SET_MEMORY_ALIAS, KVM_CREATE_IRQCHIP,
KVM_CREATE_IRQ_LINE, KVM_GET/SET_IRQCHIP
KVM_SET_TSS_ADDR

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Avi Kivity
b733bfb524 KVM: MMU: Topup the mmu memory preallocation caches before emulating an insn
Emulation may cause a shadow pte to be instantiated, which requires
memory resources.  Make sure the caches are filled to avoid an oops.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Avi Kivity
3067714cf5 KVM: Move page fault processing to common code
The code that dispatches the page fault and emulates if we failed to map
is duplicated across vmx and svm.  Merge it to simplify further bugfixing.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Avi Kivity
c7e75a3db4 KVM: x86 emulator: don't depend on cr2 for mov abs emulation
The 'mov abs' instruction family (opcodes 0xa0 - 0xa3) still depends on cr2
provided by the page fault handler.  This is wrong for several reasons:

- if an instruction accessed misaligned data that crosses a page boundary,
  and if the fault happened on the second page, cr2 will point at the
  second page, not the data itself.

- if we're emulating in real mode, or due to a FlexPriority exit, there
  is no cr2 generated.

So, this change adds decoding for this instruction form and drops reliance
on cr2.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Laurent Vivier
fe7935d49f KVM: SVM: Let gcc to choose which registers to save (i386)
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD i386

* Original code saves following registers:

    ebx, ecx, edx, esi, edi, ebp

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    ebx, ecx, edx, esi, edi

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:57 +02:00
Laurent Vivier
54a08c0449 KVM: SVM: Let gcc to choose which registers to save (x86_64)
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD x86_64.

* Original code saves following registers:

    rbx, rcx, rdx, rsi, rdi, rbp,
    r8, r9, r10, r11, r12, r13, r14, r15

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    rbx, rcx, rdx, rsi, rdi
    r8, r9, r10, r11, r12, r13, r14, r15

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Laurent Vivier
ff593e5abe KVM: VMX: Let gcc to choose which registers to save (i386)
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel i386.

* Original code saves following registers:

    eax, ebx, ecx, edx, edi, esi, ebp (using popa)

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    ebx, edi, rsi

  - doesn't save eax because it is an output operand (vmx->fail)

  - cannot put ecx in clobber description because it is an input operand,
    but as we modify it and we want to keep its value (vcpu), we must
    save it (pop/push)

  - ebp is saved (pop/push) because GCC seems to ignore its use the clobber
    description.

  - edx is saved (pop/push) because it is reserved by GCC (REGPARM) and
    cannot be put in the clobber description.

  - line "mov (%%esp), %3 \n\t" has been removed because %3
    is ecx and ecx is restored just after.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Laurent Vivier
c20363006a KVM: VMX: Let gcc to choose which registers to save (x86_64)
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel x86_64.

* Original code saves following registers:

    rax, rbx, rcx, rdx, rsi, rdi, rbp,
    r8, r9, r10, r11, r12, r13, r14, r15

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    rbx, rdi, rsi,
    r8, r9, r10, r11, r12, r13, r14, r15

  - doesn't save rax because it is an output operand (vmx->fail)

  - cannot put rcx in clobber description because it is an input operand,
    but as we modify it and we want to keep its value (vcpu), we must
    save it (pop/push)

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

  - rdx is saved (pop/push) because it is reserved by GCC (REGPARM) and
    cannot be put in the clobber description.

  - line "mov (%%rsp), %3 \n\t" has been removed because %3
    is rcx and rcx is restored just after.

  - line ASM_VMX_VMWRITE_RSP_RDX() is moved out of the ifdef/else/endif

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Izik Eidus
cbc9402297 KVM: Add ioctl to tss address from userspace,
Currently kvm has a wart in that it requires three extra pages for use
as a tss when emulating real mode on Intel.  This patch moves the allocation
internally, only requiring userspace to tell us where in the physical address
space we can place the tss.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Izik Eidus
e0d62c7f48 KVM: Add kernel-internal memory slots
Reserve a few memory slots for kernel internal use.  This is good for case
you have to register memory region and you want to be sure it was not
registered from userspace, and for case you want to register a memory region
that won't be seen from userspace.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Izik Eidus
210c7c4d7f KVM: Export memory slot allocation mechanism
Remove kvm memory slot allocation mechanism from the ioctl
and put it to exported function.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:56 +02:00
Izik Eidus
80b14b5b32 KVM: Unmap kernel-allocated memory on slot destruction
kvm_vm_ioctl_set_memory_region() is able to remove memory in addition to
adding it.  Therefore when using kernel swapping support for old userspaces,
we need to munmap the memory if the user request to remove it

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:55 +02:00
Eddie Dong
8c392696e7 KVM: Split IOAPIC reset function and export for kernel RESET
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:55 +02:00
Eddie Dong
2fcceae145 KVM: Export PIC reset for kernel device reset
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:55 +02:00
Avi Kivity
60395224d9 KVM: Add a might_sleep() annotation to gfn_to_page()
This will help trap accesses to guest memory in atomic context.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:55 +02:00
Avi Kivity
e00c8cf29b KVM: Move vmx_vcpu_reset() out of vmx_vcpu_setup()
Split guest reset code out of vmx_vcpu_setup().  Besides being cleaner, this
moves the realmode tss setup (which can sleep) outside vmx_vcpu_setup()
(which is executed with preemption enabled).

[izik: remove unused variable]

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:55 +02:00
Zhang Xiantao
34c16eecf7 KVM: Portability: Split kvm_vcpu into arch dependent and independent parts (part 1)
First step to split kvm_vcpu.  Currently, we just use an macro to define
the common fields in kvm_vcpu for all archs, and all archs need to define
its own kvm_vcpu struct.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Anthony Liguori
8d4e1288eb KVM: Allocate userspace memory for older userspace
Allocate a userspace buffer for older userspaces.  Also eliminate phys_mem
buffer.  The memset() in kvmctl really kills initial memory usage but swapping
works even with old userspaces.

A side effect is that maximum guest side is reduced for older userspace on
i386.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Christian Borntraeger
e56a7a28e2 KVM: Use virtual cpu accounting if available for guest times.
ppc and s390 offer the possibility to track process times precisely
by looking at cpu timer on every context switch, irq, softirq etc.
We can use that infrastructure as well for guest time accounting.
We need to account the used time before we change the state.
This patch adds a call to account_system_vtime to kvm_guest_enter
and kvm_guest exit. If CONFIG_VIRT_CPU_ACCOUNTING is not set,
account_system_vtime is defined in hardirq.h as an empty function,
which means this patch does not change the behaviour on other
platforms.

I compile tested this patch on x86 and function tested the patch on
s390.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Izik Eidus
8a7ae055f3 KVM: MMU: Partial swapping of guest memory
This allows guest memory to be swapped.  Pages which are currently mapped
via shadow page tables are pinned into memory, but all other pages can
be freely swapped.

The patch makes gfn_to_page() elevate the page's reference count, and
introduces kvm_release_page() that pairs with it.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Izik Eidus
cea7bb2128 KVM: MMU: Make gfn_to_page() always safe
In case the page is not present in the guest memory map, return a dummy
page the guest can scribble on.

This simplifies error checking in its users.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Izik Eidus
9647c14c98 KVM: MMU: Keep a reverse mapping of non-writable translations
The current kvm mmu only reverse maps writable translation.  This is used
to write-protect a page in case it becomes a pagetable.

But with swapping support, we need a reverse mapping of read-only pages as
well:  when we evict a page, we need to remove any mapping to it, whether
writable or not.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Izik Eidus
98348e9507 KVM: MMU: Add rmap_next(), a helper for walking kvm rmaps
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:54 +02:00
Nitin A Kamble
b284be5764 KVM: x86 emulator: cmc, clc, cli, sti
Instruction: cmc, clc, cli, sti
opcodes: 0xf5, 0xf8, 0xfa, 0xfb respectively.

[avi: fix reference to EFLG_IF which is not defined anywhere]

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Avi Kivity
42bf3f0a1f KVM: MMU: Simplify page table walker
Simplify the walker level loop not to carry so much information from one
loop to the next.  In addition to being complex, this made kmap_atomic()
critical sections difficult to manage.

As a result of this change, kmap_atomic() sections are limited to actually
touching the guest pte, which allows the other functions called from the
walker to do sleepy operations.  This will happen when we enable swapping.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Nitin A Kamble
d77a25074a KVM: x86 emulator: Implement emulation of instruction: inc & dec
Instructions:
	inc r16/r32 (opcode 0x40-0x47)
	dec r16/r32 (opcode 0x48-0x4f)

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Avi Kivity
3176bc3e59 KVM: Rename KVM_TLB_FLUSH to KVM_REQ_TLB_FLUSH
We now have a new namespace, KVM_REQ_*, for bits in vcpu->requests.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Avi Kivity
ab6ef34b90 KVM: Move apic timer interrupt backlog processing to common code
Beside the obvious goodness of making code more common, this prevents
a livelock with the next patch which moves interrupt injection out of the
critical section.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Laurent Vivier
e25e3ed56f KVM: Add some \n in ioapic_debug()
Add new-line at end of debug strings.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:53 +02:00
Qing He
e4d47f404b KVM: apic round robin cleanup
If no apic is enabled in the bitmap of an interrupt delivery with delivery
mode of lowest priority, a warning should be reported rather than select
a fallback vcpu

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie (Yaozu) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Carsten Otte
313a3dc75d KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.

Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.

x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS

An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Avi Kivity
c4fcc27246 KVM: MMU: When updating the dirty bit, inform the mmu about it
Since the mmu uses different shadow pages for dirty large pages and clean
large pages, this allows the mmu to drop ptes that are now invalid.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Avi Kivity
5df34a86f9 KVM: MMU: Move dirty bit updates to a separate function
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Avi Kivity
6bfccdc9ae KVM: MMU: Instantiate real-mode shadows as user writable shadows
This is consistent with real-mode permissions.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Avi Kivity
cc70e7374d KVM: MMU: Disable write access on clean large pages
By forcing clean huge pages to be read-only, we have separate roles
for the shadow of a clean large page and the shadow of a dirty large
page.  This is necessary because different ptes will be instantiated
for the two cases, even for read faults.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:52 +02:00
Avi Kivity
c22e3514fc KVM: MMU: Fix nx access bit for huge pages
We must set the bit before the shift, otherwise the wrong bit gets set.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Avi Kivity
e3c5e7ec9e KVM: Move guest pte dirty bit management to the guest pagetable walker
This is more consistent with the accessed bit management, and makes the dirty
bit available earlier for other purposes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Anthony Liguori
4a4c992487 KVM: MMU: More struct kvm_vcpu -> struct kvm cleanups
This time, the biggest change is gpa_to_hpa. The translation of GPA to HPA does
not depend on the VCPU state unlike GVA to GPA so there's no need to pass in
the kvm_vcpu.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Anthony Liguori
f67a46f4aa KVM: MMU: Clean up MMU functions to take struct kvm when appropriate
Some of the MMU functions take a struct kvm_vcpu even though they affect all
VCPUs.  This patch cleans up some of them to instead take a struct kvm.  This
makes things a bit more clear.

The main thing that was confusing me was whether certain functions need to be
called on all VCPUs.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Carsten Otte
043405e100 KVM: Move x86 msr handling to new files x86.[ch]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Izik Eidus
6fc138d227 KVM: Support assigning userspace memory to the guest
Instead of having the kernel allocate memory to the guest, let userspace
allocate it and pass the address to the kernel.

This is required for s390 support, but also enables features like memory
sharing and using hugetlbfs backed memory.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:51 +02:00
Mike Day
d77c26fce9 KVM: CodingStyle cleanup
Signed-off-by: Mike D. Day <ncmike@ncultra.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Rusty Russell
7e620d16b8 KVM: Remove gratuitous casts from lapic.c
Since vcpu->apic is of the correct type, there's not need to cast.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Rusty Russell
76fafa5e22 KVM: Hoist kvm_create_lapic() into kvm_vcpu_init()
Move kvm_create_lapic() into kvm_vcpu_init(), rather than having svm
and vmx do it.  And make it return the error rather than a fairly
random -ENOMEM.

This also solves the problem that neither svm.c nor vmx.c actually
handles the error path properly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Rusty Russell
d589444e92 KVM: Add kvm_free_lapic() to pair with kvm_create_lapic()
Instead of the asymetry of kvm_free_apic, implement kvm_free_lapic().
And guess what?  I found a minor bug: we don't need to hrtimer_cancel()
from kvm_main.c, because we do that in kvm_free_apic().

Also:
1) kvm_vcpu_uninit should be the reverse order from kvm_vcpu_init.
2) Don't set apic->regs_page to zero before freeing apic.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Izik Eidus
82ce2c9683 KVM: Allow dynamic allocation of the mmu shadow cache size
The user is now able to set how many mmu pages will be allocated to the guest.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Izik Eidus
195aefde9c KVM: Add general accessors to read and write guest memory
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Izik Eidus
290fc38da8 KVM: Remove the usage of page->private field by rmap
When kvm uses user-allocated pages in the future for the guest, we won't
be able to use page->private for rmap, since page->rmap is reserved for
the filesystem.  So we move the rmap base pointers to the memory slot.

A side effect of this is that we need to store the gfn of each gpte in
the shadow pages, since the memory slot is addressed by gfn, instead of
hfn like struct page.

Signed-off-by: Izik Eidus <izik@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Avi Kivity
f566e09fc2 KVM: VMX: Simplify vcpu_clear()
Now that smp_call_function_single() knows how to call a function on the
current cpu, there's no need to check explicitly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Avi Kivity
eae5ecb5b9 KVM: VMX: Don't clear the vmcs if the vcpu is not loaded on any processor
Noted by Eddie Dong.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Laurent Vivier
b4c6abfef4 KVM: x86 emulator: Any legacy prefix after a REX prefix nullifies its effect
This patch modifies the management of REX prefix according behavior
I saw in Xen 3.1.  In Xen, this modification has been introduced by
Jan Beulich.

http://lists.xensource.com/archives/html/xen-changelog/2007-01/msg00081.html

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Laurent Vivier
a22436b7b8 KVM: Purify x86_decode_insn() error case management
The only valid case is on protected page access, other cases are errors.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Qing He
e4f8e03956 KVM: x86_emulator: no writeback for bt
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Laurent Vivier
a01af5ec51 KVM: x86 emulator: Remove no_wb, use dst.type = OP_NONE instead
Remove no_wb, use dst.type = OP_NONE instead, idea stollen from xen-3.1

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:49 +02:00
Laurent Vivier
05f086f87e KVM: x86 emulator: remove _eflags and use directly ctxt->eflags.
Remove _eflags and use directly ctxt->eflags. Caching eflags is not needed as
it is restored to vcpu by kvm_main.c:emulate_instruction() from ctxt->eflags
only if emulation doesn't fail.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Laurent Vivier
8cdbd2c9bf KVM: x86 emulator: split some decoding into functions for readability
To improve readability, move push, writeback, and grp 1a/2/3/4/5/9 emulation
parts into functions.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Ryan Harper
217648638c KVM: MMU: Ignore reserved bits in cr3 in non-pae mode
This patch removes the fault injected when the guest attempts to set reserved
bits in cr3.  X86 hardware doesn't generate a fault when setting reserved bits.
The result of this patch is that vmware-server, running within a kvm guest,
boots and runs memtest from an iso.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Avi Kivity
12b7d28fc1 KVM: MMU: Make flooding detection work when guest page faults are bypassed
When we allow guest page faults to reach the guests directly, we lose
the fault tracking which allows us to detect demand paging.  So we provide
an alternate mechnism by clearing the accessed bit when we set a pte, and
checking it later to see if the guest actually used it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Avi Kivity
c7addb9020 KVM: Allow not-present guest page faults to bypass kvm
There are two classes of page faults trapped by kvm:
 - host page faults, where the fault is needed to allow kvm to install
   the shadow pte or update the guest accessed and dirty bits
 - guest page faults, where the guest has faulted and kvm simply injects
   the fault back into the guest to handle

The second class, guest page faults, is pure overhead.  We can eliminate
some of it on vmx using the following evil trick:
 - when we set up a shadow page table entry, if the corresponding guest pte
   is not present, set up the shadow pte as not present
 - if the guest pte _is_ present, mark the shadow pte as present but also
   set one of the reserved bits in the shadow pte
 - tell the vmx hardware not to trap faults which have the present bit clear

With this, normal page-not-present faults go directly to the guest,
bypassing kvm entirely.

Unfortunately, this trick only works on Intel hardware, as AMD lacks a
way to discriminate among page faults based on error code.  It is also
a little risky since it uses reserved bits which might become unreserved
in the future, so a module parameter is provided to disable it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Avi Kivity
51c6cf662b KVM: VMX: Further reduce efer reloads
KVM avoids reloading the efer msr when the difference between the guest
and host values consist of the long mode bits (which are switched by
hardware) and the NX bit (which is emulated by the KVM MMU).

This patch also allows KVM to ignore SCE (syscall enable) when the guest
is running in 32-bit mode.  This is because the syscall instruction is
not available in 32-bit mode on Intel processors, so the SCE bit is
effectively meaningless.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:47 +02:00
Laurent Vivier
3427318fd2 KVM: Call x86_decode_insn() only when needed
Move emulate_ctxt to kvm_vcpu to keep emulate context when we exit from kvm
module. Call x86_decode_insn() only when needed. Modify x86_emulate_insn() to
not modify the context if it must be re-entered.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:47 +02:00
Laurent Vivier
1be3aa4718 KVM: emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn()
emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn().
x86_emulate_insn() is x86_emulate_memop() without the decoding part.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:47 +02:00
Laurent Vivier
8b4caf6650 KVM: x86 emulator: move all decoding process to function x86_decode_insn()
Split the decoding process into a new function x86_decode_insn().

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:47 +02:00
Laurent Vivier
e4e03deda8 KVM: x86 emulator: move all x86_emulate_memop() to a structure
Move all x86_emulate_memop() common variables between decode and execute to a
structure decode_cache.  This will help in later separating decode and
emulate.

            struct decode_cache {
                u8 twobyte;
                u8 b;
                u8 lock_prefix;
                u8 rep_prefix;
                u8 op_bytes;
                u8 ad_bytes;
                struct operand src;
                struct operand dst;
                unsigned long *override_base;
                unsigned int d;
                unsigned long regs[NR_VCPU_REGS];
                unsigned long eip;
                /* modrm */
                u8 modrm;
                u8 modrm_mod;
                u8 modrm_reg;
                u8 modrm_rm;
                u8 use_modrm_ea;
                unsigned long modrm_ea;
                unsigned long modrm_val;
           };

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:47 +02:00
Laurent Vivier
a7ddce3afc KVM: x86 emulator: remove unused functions
Remove #ifdef functions never used

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:46 +02:00
Anthony Liguori
7aa81cc047 KVM: Refactor hypercall infrastructure (v3)
This patch refactors the current hypercall infrastructure to better
support live migration and SMP.  It eliminates the hypercall page by
trapping the UD exception that would occur if you used the wrong hypercall
instruction for the underlying architecture and replacing it with the right
one lazily.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to
communicate with userspace as PIO or MMIO can be used.  There is no code
in tree that uses userspace hypercalls.

[avi: fix #ud injection on vmx]

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:46 +02:00
Anthony Liguori
aca7f96600 KVM: x86 emulator: Add vmmcall/vmcall to x86_emulate (v3)
Add vmmcall/vmcall to x86_emulate.  Future patch will implement functionality
for these instructions.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:46 +02:00
Linus Torvalds
dd430ca20c Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: (890 commits)
  x86: fix nodemap_size according to nodeid bits
  x86: fix overlap between pagetable with bss section
  x86: add PCI IDs to k8topology_64.c
  x86: fix early_ioremap pagetable ops
  x86: use the same pgd_list for PAE and 64-bit
  x86: defer cr3 reload when doing pud_clear()
  x86: early boot debugging via FireWire (ohci1394_dma=early)
  x86: don't special-case pmd allocations as much
  x86: shrink some ifdefs in fault.c
  x86: ignore spurious faults
  x86: remove nx_enabled from fault.c
  x86: unify fault_32|64.c
  x86: unify fault_32|64.c with ifdefs
  x86: unify fault_32|64.c by ifdef'd function bodies
  x86: arch/x86/mm/init_32.c printk fixes
  x86: arch/x86/mm/init_32.c cleanup
  x86: arch/x86/mm/init_64.c printk fixes
  x86: unify ioremap
  x86: fixes some bugs about EFI memory map handling
  x86: use reboot_type on EFI 32
  ...
2008-01-31 00:40:09 +11:00
Linus Torvalds
60e233172e [net] Gracefully handle shared e1000/1000e driver PCI ID's
Both the old e1000 driver and the new e1000e driver can drive some
PCI-Express e1000 cards, and we should avoid ambiguity about which
driver will pick up the support for those cards when both drivers are
enabled.

This solves the problem by having the old driver support those cards if
the new driver isn't configured, but otherwise ceding support for PCI
Express versions of the e1000 chipset to the newer driver.  Thus
allowing both legacy configurations where only the old driver is active
(and handles all chips it knows about) and the new configuration with
the new driver handling the more modern PCIE variants.

Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-31 00:30:15 +11:00
Bernhard Kaindl
f212ec4b7b x86: early boot debugging via FireWire (ohci1394_dma=early)
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.

If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.

With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.

In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.

An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.

A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt

It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff

Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:34:11 +01:00
Ingo Molnar
5398f9854f x86: remove flush_agp_mappings()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:34:07 +01:00
Thomas Gleixner
d7c8f21a8c x86: cpa: move flush to cpa
The set_memory_* and set_pages_* family of API's currently requires the
callers to do a global tlb flush after the function call; forgetting this is
a very nasty deathtrap. This patch moves the global tlb flush into
each of the callers

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:34:07 +01:00
Arjan van de Ven
6d238cc4dc x86: convert CPA users to the new set_page_ API
This patch converts various users of change_page_attr() to the new,
more intent driven set_page_*/set_memory_* API set.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-30 13:34:06 +01:00
Yi Yang
53391fa20c cpufreq: fix obvious condition statement error
The function __cpufreq_set_policy in file drivers/cpufreq/cpufreq.c
has a very obvious error:

        if (policy->min > data->min && policy->min > policy->max) {
                ret = -EINVAL;
                goto error_out;
        }

This condtion statement is wrong because it returns -EINVAL only if
policy->min is greater than policy->max (in this case,
"policy->min > data->min" is true for ever.). In fact, it should
return -EINVAL as well if policy->max is less than data->min.

The correct condition should be:

	if (policy->min > data->max || policy->max < data->min) {

The following test result testifies the above conclusion:

Before applying this patch:

[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2394000 1596000
[root@yangyi-dev /]# echo 1596000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1596000
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
1596000
[root@yangyi-dev /]# echo "2000000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
1596000
[root@yangyi-dev /]# echo "0" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1596000
[root@yangyi-dev /]# echo "1595000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1596000
[root@yangyi-dev /]#

After applying this patch:

[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2394000 1596000
[root@yangyi-dev /]# echo 1596000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1596000
[root@yangyi-dev /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
1596000
[root@localhost /]# echo "2000000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
-bash: echo: write error: Invalid argument
[root@localhost /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
1596000
[root@localhost /]# echo "0" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
-bash: echo: write error: Invalid argument
[root@localhost /]# echo "1595000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
-bash: echo: write error: Invalid argument
[root@localhost /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1596000
[root@localhost /]# echo "1596000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@localhost /]# echo "2394000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
[root@localhost /]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
2394000
[root@localhost /]

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:34 +01:00
Yinghai Lu
3212bff370 x86: left over fix for leak of early_ioremp in dmi_scan
Signed-off-by: Yinghai Lu <yinghai@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:32 +01:00
Bernhard Walle
f8f76481bc rtc: use the IRQ callback interface in (old) RTC driver
the previous patch in the old RTC driver.  It also removes the direct
rtc_interrupt() call from arch/x86/kernel/hpetc.c so that there's finally no
(code) dependency to CONFIG_RTC in arch/x86/kernel/hpet.c.

Because of this, it's possible to compile the drivers/char/rtc.ko driver as
module and still use the HPET emulation functionality.  This is also expressed
in Kconfig.

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: Andi Kleen <ak@suse.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Robert Picco <Robert.Picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:31 +01:00
Ingo Molnar
0d64484f7e x86: fix DMI ioremap leak
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:09 +01:00
Andi Kleen
ddb25f9ac1 x86: don't disable TSC in any C states on AMD Fam10h
The ACPI code currently disables TSC use in any C2 and C3
states. But the AMD Fam10h BKDG documents that the TSC
will never stop in any C states when the CONSTANT_TSC bit is
set. Make this disabling conditional on CONSTANT_TSC
not set on AMD.

I actually think this is true on Intel too for C2 states
on CPUs with p-state invariant TSC, but this needs
further discussions with Len to really confirm :-)

So far it is only enabled on AMD.

Cc: lenb@kernel.org

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:41 +01:00
Andrew Morton
39657b6546 git-x86: drivers/pnp/pnpbios/bioscalls.c build fix
drivers/pnp/pnpbios/bioscalls.c:64: warning: (near initialization for 'bad_bios_desc.<anonymous>')

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:31 +01:00
Venki Pallipadi
bde6f5f59c x86: voluntary leave_mm before entering ACPI C3
Aviod TLB flush IPIs during C3 states by voluntary leave_mm()
before entering C3.

The performance impact of TLB flush on C3 should not be significant with
respect to C3 wakeup latency. Also, CPUs tend to flush TLB in hardware while in
C3 anyways.

On a 8 logical CPU system, running make -j2, the number of tlbflush IPIs goes
down from 40 per second to ~ 0. Total number of interrupts during the run
of this workload was ~1200 per second, which makes it ~3% savings in wakeups.

There was no measurable performance or power impact however.

[ akpm@linux-foundation.org: symbol export fixes. ]

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:01 +01:00
Parag Warudkar
79da472111 x86: fix DMI out of memory problems
People with HP Desktops (including me) encounter couple of DMI errors
during boot - dmi_save_oem_strings_devices: out of memory and
dmi_string: out of memory.

On some HP desktops the DMI data include OEM strings (type 11) out of
which only few are meaningful and most other are empty. DMI code
religiously creates copies of these 27 strings (65 bytes each in my
case) and goes OOM in dmi_string().

If DMI_MAX_DATA is bumped up a little then it goes and fails in
dmi_save_oem_strings while allocating dmi_devices of sizeof(struct
dmi_device) corresponding to these strings.

On x86_64 since we cannot use alloc_bootmem this early, the code uses a
static array of 2048 bytes (DMI_MAX_DATA) for allocating the memory DMI
needs. It does not survive the creation of empty strings and devices.

Fix this by detecting and not newly allocating empty strings and instead
using a one statically defined dmi_empty_string.

Also do not create a new struct dmi_device for each empty string - use
one statically define dmi_device with .name=dmi_empty_string and add
that to the dmi_devices list.

On x64 this should stop the OOM with same current size of DMI_MAX_DATA
and on x86 this should save a good amount of (27*65 bytes +
27*sizeof(struct dmi_device) bootmem.

Compile and boot tested on both 32-bit and 64-bit x86.

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:59 +01:00
Glauber de Oliveira Costa
053de04441 x86: get rid of _MASK flags
There's no need for the *_MASK flags (TF_MASK, IF_MASK, etc), found in
processor.h (both _32 and _64). They have a one-to-one mapping with the
EFLAGS value. This patch removes the definitions, and use the already
existent X86_EFLAGS_ version when applicable.

[ roland@redhat.com: KVM build fixes. ]

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:27 +01:00
Ingo Molnar
41e191e85a x86: replace outb_p() with udelay(2) in drivers/input/mouse/pc110pad.c
replace outb_p() with udelay(2). This is a real ISA device so it likely
needs this particular delay.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:24 +01:00
Glauber de Oliveira Costa
6b68f01baa x86: unify struct desc_ptr
This patch unifies struct desc_ptr between i386 and x86_64.
They can be expressed in the exact same way in C code, only
having to change the name of one of them. As Xgt_desc_struct
is ugly and big, this is the one that goes away.

There's also a padding field in i386, but it is not really
needed in the C structure definition.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:12 +01:00
Ingo Molnar
5fd1fe9c58 x86: clean up drivers/char/rtc.c
tons of style cleanup in drivers/char/rtc.c - no code changed:

   text    data     bss     dec     hex filename
   6400     384      32    6816    1aa0 rtc.o.before
   6400     384      32    6816    1aa0 rtc.o.after

since we seem to have a number of open breakages in this code we might
as well start with making the code more readable and maintainable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:09 +01:00
H. Peter Anvin
faca62273b x86: use generic register name in the thread and tss structures
This changes size-specific register names (eip/rip, esp/rsp, etc.) to
generic names in the thread and tss structures.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:02 +01:00
Thomas Gleixner
02456c708e x86: nuke a ton of dead hpet code
No users, just ballast

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-30 13:30:27 +01:00
Balaji Rao
37a47db8d7 x86: assign IRQs to HPET timers, fix
Looks like IRQ 31 is assigned to timer 3, even without the patch!
I wonder who wrote the number 31. But the manual says that it is
zero by default.

I think we should check whether the timer has been allocated an IRQ before
proceeding to assign one to it.  Here is a patch that does this.

Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Tested-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:03 +01:00
Balaji Rao
e3f37a54f6 x86: assign IRQs to HPET timers
The userspace API for the HPET (see Documentation/hpet.txt) did not work. The
HPET_IE_ON ioctl was failing as there was no IRQ assigned to the timer
device. This patch fixes it by allocating IRQs to timer blocks in the HPET.

arch/x86/kernel/hpet.c |   13 +++++--------
drivers/char/hpet.c    |   45 ++++++++++++++++++++++++++++++++++++++-------
include/linux/hpet.h   |    2 +-
3 files changed, 44 insertions(+), 16 deletions(-)

Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:03 +01:00
Glauber de Oliveira Costa
84f12e39c8 lguest: use __PAGE_KERNEL instead of _PAGE_KERNEL
x86_64 don't expose the intermediate representation with one underline,
_PAGE_KERNEL, just the double-underlined one.

Use it, to get a common ground between 32 and 64-bit

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:19 +11:00
Glauber de Oliveira Costa
ca94f2bdd1 lguest: Use explicit includes rateher than indirect
explicitly use ktime.h include
explicitly use hrtimer.h include
explicitly use sched.h include

This patch adds headers explicitly to lguest sources file,
to avoid depending on them being included somewhere else.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:19 +11:00
Glauber de Oliveira Costa
382ac6b3fb lguest: get rid of lg variable assignments
We can save some lines of code by getting rid of
*lg = cpu... lines of code spread everywhere by now.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:18 +11:00
Glauber de Oliveira Costa
934faab464 lguest: change gpte_addr header
gpte_addr() does not depend on any guest information. So we wipe out
the lg parameter from it completely.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:18 +11:00
Glauber de Oliveira Costa
ae3749dcd8 lguest: move changed bitmap to lg_cpu
events represented in the 'changed' bitmap are per-cpu, not per-guest.
move it to the lg_cpu structure

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:17 +11:00
Glauber de Oliveira Costa
f34f8c5fea lguest: move last_pages to lg_cpu
in our new model, pages are assigned to a virtual cpu, not to a guest.
We move it to the lg_cpu structure.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:16 +11:00
Glauber de Oliveira Costa
c40a9f4719 lguest: change last_guest to last_cpu
in our model, a guest does not run in a cpu anymore: a virtual cpu
does. So we change last_guest to last_cpu

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:15 +11:00
Glauber de Oliveira Costa
2092aa277b lguest: change spte_addr header
spte_addr does not depend on any guest information, so we
wipe out the lg parameter completely.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:15 +11:00
Glauber de Oliveira Costa
1713608f28 lguest: per-vcpu lguest pgdir management
this patch makes the pgdir management per-vcpu. The pgdirs pool
is still guest-wide (although it'll probably need to grow when we
are really executing more vcpus), but the pgdidx index is gone,
since it makes no sense anymore. Instead, we use a per-vcpu
index.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:14 +11:00
Glauber de Oliveira Costa
5e232f4f42 lguest: make pending notifications per-vcpu
this patch makes the pending_notify field, used to control
pending notifications, per-vcpu, instead of per-guest

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:13 +11:00
Glauber de Oliveira Costa
4665ac8e28 lguest: makes special fields be per-vcpu
lguest struct have room for some fields, namely, cr2, ts, esp1
and ss1, that are not really guest-wide, but rather, vcpu-wide.

This patch puts it in the vcpu struct

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:13 +11:00
Glauber de Oliveira Costa
66686c2ab0 lguest: per-vcpu lguest task management
lguest uses tasks to control its running behaviour (like sending
breaks, controlling halted state, etc). In a per-vcpu environment,
each vcpu will have its own underlying task. So this patch
makes the infrastructure for that possible

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:12 +11:00
Glauber de Oliveira Costa
fc708b3e40 lguest: replace lguest_arch with lg_cpu_arch.
The fields found in lguest_arch are not really per-guest,
but per-cpu (gdt, idt, etc). So this patch turns lguest_arch
into lg_cpu_arch.

It makes sense to have a per-guest per-arch struct, but this
can be addressed later, when the need arrives.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:11 +11:00
Glauber de Oliveira Costa
a53a35a8b4 lguest: make registers per-vcpu
This is the most obvious per-vcpu field: registers.

So this patch moves it from struct lguest to struct vcpu,
and patch the places in which they are used, accordingly

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:11 +11:00
Glauber de Oliveira Costa
a3863f68b0 lguest: make emulate_insn receive a vcpu struct.
emulate_insn() needs to know about current eip, which will be,
in the future, a per-vcpu thing. So in this patch, the function
prototype is modified to receive a vcpu struct

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:10 +11:00
Glauber de Oliveira Costa
0c78441cf4 lguest: map_switcher_in_guest() per-vcpu
The switcher needs to be mapped per-vcpu, because different vcpus
will potentially have different page tables (they don't have to,
because threads will share the same).

So our first step is the make the function receive a vcpu struct

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:09 +11:00
Glauber de Oliveira Costa
177e449dc5 lguest: per-vcpu interrupt processing.
This patch adapts interrupt processing for using the vcpu struct.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:09 +11:00
Glauber de Oliveira Costa
ad8d8f3bc6 lguest: per-vcpu lguest timers
Here, I introduce per-vcpu timers. With this, we can have
local expiries, needed for accounting time in smp guests

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:08 +11:00
Glauber de Oliveira Costa
73044f05a4 lguest: make hypercalls use the vcpu struct
this patch changes do_hcall() and do_async_hcall() interfaces (and obviously their
callers) to get a vcpu struct. Again, a vcpu services the hypercall, not the whole
guest

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-30 22:50:08 +11:00