Commit Graph

15826 Commits

Author SHA1 Message Date
Al Viro
4301785c7c um/x86: merge 32 and 64bit variants of checksum.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2012-10-09 22:28:16 +02:00
Geert Uytterhoeven
9429ec96c2 um: Preinclude include/linux/kern_levels.h
The userspace part of UML uses the asm-offsets.h generator mechanism to
create definitions for UM_KERN_<LEVEL> that match the in-kernel
KERN_<LEVEL> constant definitions.

As of commit 04d2c8c83d ("printk: convert
the format for KERN_<LEVEL> to a 2 byte pattern"), KERN_<LEVEL> is no
longer expanded to the literal '"<LEVEL>"', but to '"\001" "LEVEL"', i.e.
it contains two parts.

However, the combo of DEFINE_STR() in
arch/x86/um/shared/sysdep/kernel-offsets.h and sed-y in Kbuild doesn't
support string literals consisting of multiple parts. Hence for all
UM_KERN_<LEVEL> definitions, only the SOH character is retained in the actual
definition, while the remainder ends up in the comment. E.g. in
include/generated/asm-offsets.h we get

    #define UM_KERN_INFO "\001" /* "6" KERN_INFO */

instead of

    #define UM_KERN_INFO "\001" "6" /* KERN_INFO */

This causes spurious '^A' output in some kernel messages:

    Calibrating delay loop... 4640.76 BogoMIPS (lpj=23203840)
    pid_max: default: 32768 minimum: 301
    Mount-cache hash table entries: 256
    ^AChecking that host ptys support output SIGIO...Yes
    ^AChecking that host ptys support SIGIO on close...No, enabling workaround
    ^AUsing 2.6 host AIO
    NET: Registered protocol family 16
    bio: create slab <bio-0> at 0
    Switching to clocksource itimer

To fix this:
  - Move the mapping from UM_KERN_<LEVEL> to KERN_<LEVEL> from
    arch/um/include/shared/common-offsets.h to
    arch/um/include/shared/user.h, which is preincluded for all userspace
    parts,
  - Preinclude include/linux/kern_levels.h for all userspace parts, to
    obtain the in-kernel KERN_<LEVEL> constant definitions. This doesn't
    violate the kernel/userspace separation, as include/linux/kern_levels.h
    is self-contained and doesn't expose any other kernel internals.
  - Remove the now unused STR() and DEFINE_STR() macros.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2012-09-27 20:20:09 +02:00
Richard Weinberger
bbb35efcda um: Fix IPC on um
commit c1d7e01d (ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION)
forgot UML and broke IPC on it.
Also UML has to select ARCH_WANT_IPC_PARSE_VERSION usin Kconfig.

Reported-and-tested-by: <Toralf Förster toralf.foerster@gmx.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2012-09-27 20:12:35 +02:00
Al Viro
d2ce4e92fa um: kill thread->forking
we only use that to tell copy_thread() done by syscall from that
done by kernel_thread().  However, it's easier to do simply by
checking PF_KTHREAD in thread flags.

Merge sys_clone() guts for 32bit and 64bit, while we are at it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-27 18:04:55 +02:00
Al Viro
f9a38eace4 um: let signal_delivered() do SIGTRAP on singlestepping into handler
... rather than duplicating that in sigframe setup code (and doing that
inconsistently, at that)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-27 18:04:47 +02:00
Linus Torvalds
0c59f23613 * Disable PV NUMA support as we do not do anything with it (yet)
and it can cause bootup crashes on certain AMD machines.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQYGDkAAoJEFjIrFwIi8fJrR0H/2iRt3nH3KeG6S2l2UaSvJuH
 BqtUSFFtMxKwAc9WC8gx9lc6y9HFig1SThzWuWJulNpF50QnBp38+OuMzEespoUN
 JLJtIp/jlYFZL5w2K7soXq7e0elbWTainPWvz5qJE7RifcnclAOGGrf/0LEVf/FQ
 xCjn9MLDq5WzbkwuA7TPMDb9RSD3ZJThMShj82kziwpTaniJCpl4YY0lVYUfknXo
 t2NTV6Ze6mkzU5QvxSA2ZJt89vsFXQNpsvUK0WzZfLmzugXqnqQHstaVti64ovZc
 e64PW61PIRhaZMCDuieclMXWbXFjkp4AsEQJLv3K/kojG20xJ/X08nmolnl64vw=
 =t4AN
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull a Xen fix from Konrad Rzeszutek Wilk:
 "It is a bug-fix when we run the initial PV guest on a AMD K8 machine
  and have CONFIG_AMD_NUMA enabled and detect the NUMA topology from the
  Northbridge.

  We end up in the situation where the initial domain gets too much
  information and gets confused and crashes - the fix is to restrict the
  domain to get the information - and we do it by just disabling NUMA on
  the PV guest (the hypervisor is still able to do its proper NUMA
  allocations of guests).

  It is OK to disable the PV guest from accessing NUMA data as right now
  we do not inject any NUMA node information to the PV guests.  When we
  do get to that point, then this patch will have to be reverted."

 * Disable PV NUMA support as we do not do anything with it (yet) and it
   can cause bootup crashes on certain AMD machines.

* tag 'stable/for-linus-3.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/boot: Disable NUMA for PV guests.
2012-09-24 16:14:34 -07:00
Konrad Rzeszutek Wilk
8d54db795d xen/boot: Disable NUMA for PV guests.
The hypervisor is in charge of allocating the proper "NUMA" memory
and dealing with the CPU scheduler to keep them bound to the proper
NUMA node. The PV guests (and PVHVM) have no inkling of where they
run and do not need to know that right now. In the future we will
need to inject NUMA configuration data (if a guest spans two or more
NUMA nodes) so that the kernel can make the right choices. But those
patches are not yet present.

In the meantime, disable the NUMA capability in the PV guest, which
also fixes a bootup issue. Andre says:

"we see Dom0 crashes due to the kernel detecting the NUMA topology not
by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA).

This will detect the actual NUMA config of the physical machine, but
will crash about the mismatch with Dom0's virtual memory. Variation of
the theme: Dom0 sees what it's not supposed to see.

This happens with the said config option enabled and on a machine where
this scanning is still enabled (K8 and Fam10h, not Bulldozer class)

We have this dump then:
NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
Scanning NUMA topology in Northbridge 24
Number of physical nodes 4
Node 0 MemBase 0000000000000000 Limit 0000000040000000
Node 1 MemBase 0000000040000000 Limit 0000000138000000
Node 2 MemBase 0000000138000000 Limit 00000001f8000000
Node 3 MemBase 00000001f8000000 Limit 0000000238000000
Initmem setup node 0 0000000000000000-0000000040000000
  NODE_DATA [000000003ffd9000 - 000000003fffffff]
Initmem setup node 1 0000000040000000-0000000138000000
  NODE_DATA [0000000137fd9000 - 0000000137ffffff]
Initmem setup node 2 0000000138000000-00000001f8000000
  NODE_DATA [00000001f095e000 - 00000001f0984fff]
Initmem setup node 3 00000001f8000000-0000000238000000
Cannot find 159744 bytes in node 3
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar
RIP: e030:[<ffffffff81d220e6>]  [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
.. snip..
  [<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178
  [<ffffffff81d23348>] sparse_init+0xe4/0x25a
  [<ffffffff81d16840>] paging_init+0x13/0x22
  [<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b
  [<ffffffff81683954>] ? printk+0x3c/0x3e
  [<ffffffff81d01a38>] start_kernel+0xe5/0x468
  [<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1
  [<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36
  [<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c
"

so we just disable NUMA scanning by setting numa_off=1.

CC: stable@vger.kernel.org
Reported-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Acked-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-24 08:47:20 -04:00
Linus Torvalds
56bae80268 Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild fixes from Michal Marek:
 "There are two more kbuild fixes for 3.6.

  One fixes a race between x86's archscripts target and the rule
  (re)building scripts/basic/fixdep.  The second is a fix for the
  previous attempt at fixing make firmware_install with make 3.82.
  This new solution should work with any version of GNU make"

* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  x86/kbuild: archscripts depends on scripts_basic
  firmware: fix directory creation rule matching with make 3.80
2012-09-23 15:40:58 -07:00
Linus Torvalds
9d10890792 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Small fixlets"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/init.c: Fix devmem_is_allowed() off by one
  x86/kconfig: Remove outdated reference to Intel CPUs in CONFIG_SWIOTLB
2012-09-21 14:26:23 -07:00
Linus Torvalds
18f5600ba2 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Small perf fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing: Don't call page_to_pfn() if page is NULL
  perf/x86: Fix Intel Ivy Bridge support
  perf/x86/ibs: Check syscall attribute flags
  perf/x86: Export Sandy Bridge uncore clockticks event in sysfs
2012-09-21 14:24:48 -07:00
Linus Torvalds
8ca7de9164 Bug-fixes:
* Fix M2P batching re-using the incorrect structure field.
  * Disable BIOS SMP MP table search.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQXGfdAAoJEFjIrFwIi8fJbWcH/0FI2d/VyB+ZU0ng3R0Oa7mt
 iR/x+Z+mfFdp2dXS6gs6DgJIZVA7i2K9pX4rOXjpDGGGyUeo1xoqjlQfsFWQGjZ/
 p49RrDrM93c2GdRXk3iMSWfboQI7BXBs5rnyYZQL7kMxUSR75MxbeONvhPrMSO9I
 3EBidWH08qjrn2HVF44F6xh5ONjpclo5AvGIzJ0eU4X0D0eqMnhvlAw8/UYJU2HV
 heRvuxWF9l2jNpLhKhZy1730D1X/vKA5qKAcBW8rCOpEijyPpmtKbqapeUJg/9pH
 NVquuwGutP5ozrSi7a/23+L+ezvQBmCPm5ZRG44PccBoZ/HVs8haT8UypSWSDzo=
 =TwvM
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - Fix M2P batching re-using the incorrect structure field.

   In v3.5 we added batching for M2P override (Machine Frame Number ->
   Physical Frame Number), but the original MFN was saved in an
   incorrect structure - and we would oops/restore when restoring with
   the old MFN.

 - Disable BIOS SMP MP table search.

   A bootup issue that we had ignored until we found that on DL380 G6 it
   was needed.

* tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/boot: Disable BIOS SMP MP table search.
  xen/m2p: do not reuse kmap_op->dev_bus_addr
2012-09-21 12:06:54 -07:00
Jeff Mahoney
24cc7fb69a x86/kbuild: archscripts depends on scripts_basic
While building the SUSE kernel packages, which build the scripts,
make clean, and then build everything, we have been running into spurious
build failures. We tracked them down to a simple dependency issue:

$ make mrproper
  CLEAN   arch/x86/tools
  CLEAN   scripts/basic
$ cp patches/config/x86_64/desktop .config
$ make archscripts
  HOSTCC  arch/x86/tools/relocs
/bin/sh: scripts/basic/fixdep: No such file or directory
make[3]: *** [arch/x86/tools/relocs] Error 1
make[2]: *** [archscripts] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2

This was introduced by commit
6520fe55 (x86, realmode: 16-bit real-mode code support for relocs),
which added the archscripts dependency to archprepare.

This patch adds the scripts_basic dependency to the x86 archscripts.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2012-09-21 13:49:47 +02:00
Konrad Rzeszutek Wilk
bd49940a35 xen/boot: Disable BIOS SMP MP table search.
As the initial domain we are able to search/map certain regions
of memory to harvest configuration data. For all low-level we
use ACPI tables - for interrupts we use exclusively ACPI _PRT
(so DSDT) and MADT for INT_SRC_OVR.

The SMP MP table is not used at all. As a matter of fact we do
not even support machines that only have SMP MP but no ACPI tables.

Lets follow how Moorestown does it and just disable searching
for BIOS SMP tables.

This also fixes an issue on HP Proliant BL680c G5 and DL380 G6:

9f->100 for 1:1 PTE
Freeing 9f-100 pfn range: 97 pages freed
1-1 mapping on 9f->100
.. snip..
e820: BIOS-provided physical RAM map:
Xen: [mem 0x0000000000000000-0x000000000009efff] usable
Xen: [mem 0x000000000009f400-0x00000000000fffff] reserved
Xen: [mem 0x0000000000100000-0x00000000cfd1dfff] usable
.. snip..
Scan for SMP in [mem 0x00000000-0x000003ff]
Scan for SMP in [mem 0x0009fc00-0x0009ffff]
Scan for SMP in [mem 0x000f0000-0x000fffff]
found SMP MP-table at [mem 0x000f4fa0-0x000f4faf] mapped at [ffff8800000f4fa0]
(XEN) mm.c:908:d0 Error getting mfn 100 (pfn 5555555555555555) from L1 entry 0000000000100461 for l1e_owner=0, pg_owner=0
(XEN) mm.c:4995:d0 ptwr_emulate: could not get_page_from_l1e()
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81ac07e2>] xen_set_pte_init+0x66/0x71
. snip..
Pid: 0, comm: swapper Not tainted 3.6.0-rc6upstream-00188-gb6fb969-dirty #2 HP ProLiant BL680c G5
.. snip..
Call Trace:
 [<ffffffff81ad31c6>] __early_ioremap+0x18a/0x248
 [<ffffffff81624731>] ? printk+0x48/0x4a
 [<ffffffff81ad32ac>] early_ioremap+0x13/0x15
 [<ffffffff81acc140>] get_mpc_size+0x2f/0x67
 [<ffffffff81acc284>] smp_scan_config+0x10c/0x136
 [<ffffffff81acc2e4>] default_find_smp_config+0x36/0x5a
 [<ffffffff81ac3085>] setup_arch+0x5b3/0xb5b
 [<ffffffff81624731>] ? printk+0x48/0x4a
 [<ffffffff81abca7f>] start_kernel+0x90/0x390
 [<ffffffff81abc356>] x86_64_start_reservations+0x131/0x136
 [<ffffffff81abfa83>] xen_start_kernel+0x65f/0x661
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

which is that ioremap would end up mapping 0xff using _PAGE_IOMAP
(which is what early_ioremap sticks as a flag) - which meant
we would get MFN 0xFF (pte ff461, which is OK), and then it would
also map 0x100 (b/c ioremap tries to get page aligned request, and
it was trying to map 0xf4fa0 + PAGE_SIZE - so it mapped the next page)
as _PAGE_IOMAP. Since 0x100 is actually a RAM page, and the _PAGE_IOMAP
bypasses the P2M lookup we would happily set the PTE to 1000461.
Xen would deny the request since we do not have access to the
Machine Frame Number (MFN) of 0x100. The P2M[0x100] is for example
0x80140.

CC: stable@vger.kernel.org
Fixes-Oracle-Bugzilla: https://bugzilla.oracle.com/bugzilla/show_bug.cgi?id=13665
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-19 15:28:28 -04:00
Stephane Eranian
20a36e39d5 perf/x86: Fix Intel Ivy Bridge support
This patch updates the existing Intel IvyBridge (model 58)
support with proper PEBS event constraints. It cannot reuse
the same as SandyBridge because some events (0xd3) are
specific to IvyBridge.

Also there is no UOPS_DISPATCHED.THREAD on IVB, so do not
populate the PERF_COUNT_HW_STALLED_CYCLES_BACKEND mapping.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Link: http://lkml.kernel.org/r/20120910230701.GA5898@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-19 17:28:47 +02:00
Linus Torvalds
7ef6e97380 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree includes various fixes"

Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
  perf/x86: Enable Intel Cedarview Atom suppport
  perf_event: Switch to internal refcount, fix race with close()
  oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
  perf/x86: Fix microcode revision check for SNB-PEBS
2012-09-14 17:43:45 -07:00
T Makphaibulchoke
73e8f3d7e2 x86/mm/init.c: Fix devmem_is_allowed() off by one
Fixing an off-by-one error in devmem_is_allowed(), which allows
accesses to physical addresses 0x100000-0x100fff, an extra page
past 1MB.

Signed-off-by: T Makphaibulchoke <tmac@hp.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: yinghai@kernel.org
Cc: tiwai@suse.de
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/1346210503-14276-1-git-send-email-tmac@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-13 17:35:54 +02:00
Robert Richter
bad9ac2d7f perf/x86/ibs: Check syscall attribute flags
Current implementation simply ignores attribute flags. Thus, there is
no notification to userland of unsupported features. Check syscall's
attribute flags to let userland know if a feature is supported by the
kernel. This is also needed to distinguish between future kernels what
might support a feature.

Cc: <stable@vger.kernel.org> v3.5..
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120910093018.GO8285@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-13 16:59:48 +02:00
Stephane Eranian
35534b201c perf/x86: Export Sandy Bridge uncore clockticks event in sysfs
This patch exports the clockticks event and its encoding to user level.
The clockticks event was exported for Nehalem/Westmere but not for Sandy
Bridge (client). Given that it uses a special encoding, it needs to be
exported to user tools, so users can do:

  # perf stat -a -C 0 -e uncore_cbox_0/clockticks/ sleep 1

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120829130122.GA32336@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-13 16:59:46 +02:00
Stefano Stabellini
2fc136eecd xen/m2p: do not reuse kmap_op->dev_bus_addr
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.

Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).

CC: stable@kernel.org
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-12 11:21:40 -04:00
Linus Torvalds
ffc2964918 KVM updates for 3.6-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQTgR7AAoJEI7yEDeUysxlLnsP/jsGydL5WWsujpK9jdEbNa+r
 Y2D8aPgKRe6Z6ZmWpe7jxymXA4haXh1kKO8nIVgfFOgtvN+Po8rZMOQE0TpuulFy
 ANYa2N9Mi4+tpi9pqjcVgafjNa7xxRer04jZr5a789ry0MXjKspoUv/rFPpOsxrL
 XlaPnRu1MLnCWmnLv3nvgatth3fIjo4X31FGmUwl4dRc59NIQREek7z3gAV0x4td
 +JK6qvwzYOz0LhwV/Ssk84Sm97OOfXVaKWvxfC8y0H6nt0nXqmfGI/d5tWL8F9xz
 92c0lelvbm6AMgx5yY5onRYDE6SkdlnAS563umIB58jMp44Fj+0MMPSfPQM1rBXC
 NZ7mh2BJJ8Dsfi2NVneWW089o+uxSfI9HkuXGEiTzcu2mjISjuJ5ih7efAAmr7xc
 e8dMM6fVTpNTazd6VGvsgno8f/Hr+D6qOZwZ8DsH4QLX0NJC41Dm6loHCE3saR2x
 tjGwcbJEYqnboXEd09YOGqVOExR111cBkhVmf1qvOB71ENXoMnJ1L3e+AWojc1xN
 klaMM8SQd0MsgjqTx4AOpPMKkzmONN1v44sFuQfARR1iwyY6dmLcJTqMQtDG9vJP
 YTpZdkmswQrAnE2xVyMWkUd3vFj2d3JRU+aELYfIcVEuxx3hsRZdR8ezcrTRZLST
 s4ii8XXmb9MGlggwrmk5
 =BPP1
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "A trio of KVM fixes: incorrect lookup of guest cpuid, an uninitialized
  variable fix, and error path cleanup fix."

* tag 'kvm-3.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: fix error paths for failed gfn_to_page() calls
  KVM: x86: Check INVPCID feature bit in EBX of leaf 7
  KVM: PIC: fix use of uninitialised variable.
2012-09-11 09:30:08 +08:00
Xiao Guangrong
4484141a94 KVM: fix error paths for failed gfn_to_page() calls
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361]  [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382]  [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401]  [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407]  [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411]  [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418]  [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424]  [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca

The test case:

	printf(fmt, ##args);		\
	exit(-1);} while (0)

static int create_vm(void)
{
	int sys_fd, vm_fd;

	sys_fd = open("/dev/kvm", O_RDWR);
	if (sys_fd < 0)
		die("open /dev/kvm fail.\n");

	vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
	if (vm_fd < 0)
		die("KVM_CREATE_VM fail.\n");

	return vm_fd;
}

static int create_vcpu(int vm_fd)
{
	int vcpu_fd;

	vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
	if (vcpu_fd < 0)
		die("KVM_CREATE_VCPU ioctl.\n");
	printf("Create vcpu.\n");
	return vcpu_fd;
}

static void *vcpu_thread(void *arg)
{
	int vm_fd = (int)(long)arg;

	create_vcpu(vm_fd);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread;
	int vm_fd;

	(void)argc;
	(void)argv;

	vm_fd = create_vm();
	pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
	printf("Exit.\n");
	return 0;
}

It caused by release kvm->arch.ept_identity_map_addr which is the
error page.

The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time

Fixed by checking the page before it is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:34:11 +03:00
Ren, Yongjie
4f97704555 KVM: x86: Check INVPCID feature bit in EBX of leaf 7
Checks and operations on the INVPCID feature bit should use EBX
of CPUID leaf 7 instead of ECX.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Yongjie Ren <yongjien.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-09 17:34:01 +03:00
Linus Torvalds
bf71d0e18e Bug-fixes:
* Fix for TLB flushing introduced in v3.6
  * Fix Xen-SWIOTLB not using proper DMA mask - device had 64bit but
    in a 32-bit kernel we need to allocate for coherent pages from a
    32-bit pool.
  * When trying to re-use P2M nodes we had a one-off error and triggered
    a BUG_ON check with specific CONFIG_ option.
  * When doing FLR in Xen-PCI-backend we would first do FLR then save the
    PCI configuration space. We needed to do it the other way around.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQSS7TAAoJEFjIrFwIi8fJOBQH/03JBKBbFvewhop8T5Jww2c9
 SWmMgzDm5HkxeWj5XnM+zlLoHrYFFu7tyuRLiny4weE0LdRl4adJ1TVpStxap/b6
 MSQKe+tZevslaReBOsMpbCk3z7fEWNlAcpm6wMp1xYmLoHcr0JMpOCmzigbf7dwM
 F4UWULheih9ME3UeqDAU8qgvfv6ccZ9vempO4TDWKjxfxfWODCNMRx+Ny+C7NNRk
 QeoInHJUqcRkg0q0OIciF/YYDmn8hIH7HgfqomuMb6rEv2LOieLnWVuniEPWM75s
 lECxro7v2Z9s1Std0WWKcFp8VZpI2scnQYU8aL8TpewHcbzKHQYyipuKPu7FiEQ=
 =GfLE
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 * Fix for TLB flushing introduced in v3.6
 * Fix Xen-SWIOTLB not using proper DMA mask - device had 64bit but
   in a 32-bit kernel we need to allocate for coherent pages from a
   32-bit pool.
 * When trying to re-use P2M nodes we had a one-off error and triggered
   a BUG_ON check with specific CONFIG_ option.
 * When doing FLR in Xen-PCI-backend we would first do FLR then save the
   PCI configuration space. We needed to do it the other way around.

* tag 'stable/for-linus-3.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pciback: Fix proper FLR steps.
  xen: Use correct masking in xen_swiotlb_alloc_coherent.
  xen: fix logical error in tlb flushing
  xen/p2m: Fix one-off error in checking the P2M tree directory.
2012-09-06 17:16:42 -07:00
Alex Shi
ce7184bdbd xen: fix logical error in tlb flushing
While TLB_FLUSH_ALL gets passed as 'end' argument to
flush_tlb_others(), the Xen code was made to check its 'start'
parameter. That may give a incorrect op.cmd to MMUEXT_INVLPG_MULTI
instead of MMUEXT_TLB_FLUSH_MULTI. Then it causes some page can not
be flushed from TLB.

This patch fixed this issue.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-05 10:50:21 -04:00
Konrad Rzeszutek Wilk
593d0a3e9f Merge commit '4cb38750d49010ae72e718d46605ac9ba5a851b4' into stable/for-linus-3.6
* commit '4cb38750d49010ae72e718d46605ac9ba5a851b4': (6849 commits)
  bcma: fix invalid PMU chip control masks
  [libata] pata_cmd64x: whitespace cleanup
  libata-acpi: fix up for acpi_pm_device_sleep_state API
  sata_dwc_460ex: device tree may specify dma_channel
  ahci, trivial: fixed coding style issues related to braces
  ahci_platform: add hibernation callbacks
  libata-eh.c: local functions should not be exposed globally
  libata-transport.c: local functions should not be exposed globally
  sata_dwc_460ex: support hardreset
  ata: use module_pci_driver
  drivers/ata/pata_pcmcia.c: adjust suspicious bit operation
  pata_imx: Convert to clk_prepare_enable/clk_disable_unprepare
  ahci: Enable SB600 64bit DMA on MSI K9AGM2 (MS-7327) v2
  [libata] Prevent interface errors with Seagate FreeAgent GoFlex
  drivers/acpi/glue: revert accidental license-related 6b66d95895 bits
  libata-acpi: add missing inlines in libata.h
  i2c-omap: Add support for I2C_M_STOP message flag
  i2c: Fall back to emulated SMBus if the operation isn't supported natively
  i2c: Add SCCB support
  i2c-tiny-usb: Add support for the Robofuzz OSIF USB/I2C converter
  ...
2012-09-05 10:22:45 -04:00
Konrad Rzeszutek Wilk
50e900417b xen/p2m: Fix one-off error in checking the P2M tree directory.
We would traverse the full P2M top directory (from 0->MAX_DOMAIN_PAGES
inclusive) when trying to figure out whether we can re-use some of the
P2M middle leafs.

Which meant that if the kernel was compiled with MAX_DOMAIN_PAGES=512
we would try to use the 512th entry. Fortunately for us the p2m_top_index
has a check for this:

 BUG_ON(pfn >= MAX_P2M_PFN);

which we hit and saw this:

(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1.2-OVM  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff819cadeb>]
(XEN) RFLAGS: 0000000000000212   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff81db5000   rbx: ffffffff81db4000   rcx: 0000000000000000
(XEN) rdx: 0000000000480211   rsi: 0000000000000000   rdi: ffffffff81db4000
(XEN) rbp: ffffffff81793db8   rsp: ffffffff81793d38   r8:  0000000008000000
(XEN) r9:  4000000000000000   r10: 0000000000000000   r11: ffffffff81db7000
(XEN) r12: 0000000000000ff8   r13: ffffffff81df1ff8   r14: ffffffff81db6000
(XEN) r15: 0000000000000ff8   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000661795000   cr2: 0000000000000000

Fixes-Oracle-Bug: 14570662
CC: stable@vger.kernel.org # only for v3.5
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-05 09:47:41 -04:00
Joe Millenbach
4454d32749 x86/kconfig: Remove outdated reference to Intel CPUs in CONFIG_SWIOTLB
Deleted the no longer valid example of which x86 CPUs lack a
hardware IOMMU, and moved the "If unsure..." statement to a new
line to follow the style of surrounding options.

Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: team-fjord@googlegroups.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1346632700-29113-1-git-send-email-jmillenbach@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-05 11:05:01 +02:00
Stephane Eranian
3ec18cd8b8 perf/x86: Enable Intel Cedarview Atom suppport
This patch enables perf_events support for Intel Cedarview
Atom (model 54) processors. Support includes PEBS and LBR.
Tested on my Atom N2600 netbook.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820092421.GA11284@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 17:29:23 +02:00
Jamie Iles
749c59fd15 KVM: PIC: fix use of uninitialised variable.
Commit aea218f3cb (KVM: PIC: call ack notifiers for irqs that are
dropped form irr) used an uninitialised variable to track whether an
appropriate apic had been found.  This could result in calling the ack
notifier incorrectly.

Cc: Gleb Natapov <gleb@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-04 15:44:42 +03:00
Michael S. Tsirkin
1d92128fe9 KVM: x86: fix KVM_GET_MSR for PV EOI
KVM_GET_MSR was missing support for PV EOI,
which is needed for migration.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 18:03:05 -03:00
Stephane Eranian
e3e45c01ae perf/x86: Fix microcode revision check for SNB-PEBS
The following patch makes the microcode update code path
actually invoke the perf_check_microcode() function and
thus potentially renabling SNB PEBS.

By default, CONFIG_MICROCODE_OLD_INTERFACE is
forced to Y in arch/x86/Kconfig. There is no
way to disable this. That means that the code
path used in arch/x86/kernel/microcode_core.c
did not include the call to perf_check_microcode().

Thus, even though the microcode was updated to a
version that fixes the SNB PEBS problem, perf_event
would still return EOPNOTSUPP when enabling precise
sampling.

This patch simply adds a call to perf_check_microcode()
in the call path used when OLD_INTERFACE=y.

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: peterz@infradead.org
Cc: andi@firstfloor.org
Link: http://lkml.kernel.org/r/20120824133434.GA8014@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-27 08:48:19 +02:00
Linus Torvalds
267560874c Three bug-fixes:
- Revert the kexec fix which caused on non-kexec shutdowns a race.
  - Reuse existing P2M leafs - instead of requiring to allocate a large
    area of bootup virtual address estate.
  - Fix a one-off error when adding PFNs for balloon pages.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQNppKAAoJEFjIrFwIi8fJU/oH/jdWdRqJgC5mCnu9LwrIemEj
 gPTAcKw01A/2vbOY5rfXx7rCpgeU5ZM/XSt0byz/J5q0bmjjKVM106Smq1s7EaQx
 OjsdLglWoZYzKJjXH/FEKRPD39f/hd+KNJu3aGEJM8UZ0htvxlg6ACGzVPJa83Pf
 yrRXSycxvEevbGbuwWdNubxD5WKMMmbzi/HGGfdtL4256d0xIgxMrYgskLek96cR
 cg11llC5QLzH8mX+M5iX0lchASvMITyERXyEKK2opFN8a/766yi16agP75RKZdkP
 kWXp0vyOMrpy9UnOs2V1XLc/ufqNwHLcPVfecScXhz8xZWrZYOBdJQf7HAWxvLE=
 =MgvT
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull three xen bug-fixes from Konrad Rzeszutek Wilk:
 - Revert the kexec fix which caused on non-kexec shutdowns a race.
 - Reuse existing P2M leafs - instead of requiring to allocate a large
   area of bootup virtual address estate.
 - Fix a one-off error when adding PFNs for balloon pages.

* tag 'stable/for-linus-3.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.
  xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
  Revert "xen PVonHVM: move shared_info to MMIO before kexec"
2012-08-25 17:31:59 -07:00
Linus Torvalds
6ec9776c28 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Marcelo Tosatti.

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86 emulator: use stack size attribute to mask rsp in stack ops
  KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
  ppc: e500_tlb memset clears nothing
  KVM: PPC: Add cache flush on page map
  KVM: PPC: Book3S HV: Fix incorrect branch in H_CEDE code
  KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
2012-08-25 17:27:17 -07:00
Konrad Rzeszutek Wilk
c96aae1f7f xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.
When we are finished with return PFNs to the hypervisor, then
populate it back, and also mark the E820 MMIO and E820 gaps
as IDENTITY_FRAMEs, we then call P2M to set areas that can
be used for ballooning. We were off by one, and ended up
over-writting a P2M entry that most likely was an IDENTITY_FRAME.
For example:

1-1 mapping on 40000->40200
1-1 mapping on bc558->bc5ac
1-1 mapping on bc5b4->bc8c5
1-1 mapping on bc8c6->bcb7c
1-1 mapping on bcd00->100000
Released 614 pages of unused memory
Set 277889 page(s) to 1-1 mapping
Populating 40200-40466 pfn range: 614 pages added

=> here we set from 40466 up to bc559 P2M tree to be
INVALID_P2M_ENTRY. We should have done it up to bc558.

The end result is that if anybody is trying to construct
a PTE for PFN bc558 they end up with ~PAGE_PRESENT.

CC: stable@vger.kernel.org
Reported-by-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-08-23 10:14:52 -04:00
Andreas Herrmann
36bf50d769 x86, microcode, AMD: Fix broken ucode patch size check
This issue was recently observed on an AMD C-50 CPU where a patch of
maximum size was applied.

Commit be62adb492 ("x86, microcode, AMD: Simplify ucode verification")
added current_size in get_matching_microcode(). This is calculated as
size of the ucode patch + 8 (ie. size of the header). Later this is
compared against the maximum possible ucode patch size for a CPU family.
And of course this fails if the patch has already maximum size.

Cc: <stable@vger.kernel.org> [3.3+]
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-08-22 16:10:41 -07:00
Avi Kivity
5ad105e569 KVM: x86 emulator: use stack size attribute to mask rsp in stack ops
The sub-register used to access the stack (sp, esp, or rsp) is not
determined by the address size attribute like other memory references,
but by the stack segment's B bit (if not in x86_64 mode).

Fix by using the existing stack_mask() to figure out the correct mask.

This long-existing bug was exposed by a combination of a27685c33a
(emulate invalid guest state by default), which causes many more
instructions to be emulated, and a seabios change (possibly a bug) which
causes the high 16 bits of esp to become polluted across calls to real
mode software interrupts.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-22 18:54:26 -03:00
Takuya Yoshikawa
35f2d16bb9 KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
Although the possible race described in

  commit 85b7059169
  KVM: MMU: fix shrinking page from the empty mmu

was correct, the real cause of that issue was a more trivial bug of
mmu_shrink() introduced by

  commit 1952639665
  KVM: MMU: do not iterate over all VMs in mmu_shrink()

Here is the bug:

	if (kvm->arch.n_used_mmu_pages > 0) {
		if (!nr_to_scan--)
			break;
		continue;
	}

We skip VMs whose n_used_mmu_pages is not zero and try to shrink others:
in other words we try to shrink empty ones by mistake.

This patch reverses the logic so that mmu_shrink() can free pages from
the first VM whose n_used_mmu_pages is not zero.  Note that we also add
comments explaining the role of nr_to_scan which is not practically
important now, hoping this will be improved in the future.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:27:13 +03:00
Avi Kivity
cb09cad44f x86/alternatives: Fix p6 nops on non-modular kernels
Probably a leftover from the early days of self-patching, p6nops
are marked __initconst_or_module, which causes them to be
discarded in a non-modular kernel.  If something later triggers
patching, it will overwrite kernel code with garbage.

Reported-by: Tomas Racek <tracek@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22 12:09:49 +02:00
Liu, Chuansheng
2530cd4f44 x86/fixup_irq: Use cpu_online_mask instead of cpu_all_mask
When one CPU is going down and this CPU is the last one in irq
affinity, current code is setting cpu_all_mask as the new
affinity for that irq.

But for some systems (such as in Medfield Android mobile) the
firmware sends the interrupt to each CPU in the irq affinity
mask, averaged, and cpu_all_mask includes all potential CPUs,
i.e. offline ones as well.

So replace cpu_all_mask with cpu_online_mask.

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22 10:36:08 +02:00
Richard Weinberger
83be4ffa1a x86/spinlocks: Fix comment in spinlock.h
This comment is no longer true.  We support up to 2^16 CPUs
because __ticket_t is an u16 if NR_CPUS is larger than 256.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22 09:52:47 +02:00
Michal Hocko
eb48c07146 mm: hugetlbfs: correctly populate shared pmd
Each page mapped in a process's address space must be correctly
accounted for in _mapcount.  Normally the rules for this are
straightforward but hugetlbfs page table sharing is different.  The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

  kernel BUG at mm/filemap.c:135!
  invalid opcode: 0000 [#1] SMP
  CPU 22
  Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
  Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
  RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
  Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
  Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte.  The logic is if
the PMD page is the same, they must be shared.  This assumes that the
sharing is between the parent and child.  However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

  parent
    |
    ------------>pmd
                 src_pte----------> data page
                                        ^
  other--------->pmd--------------------|
                  ^
  child-----------|
                 dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other.  This is
possible due to the following style of race.

  PROC A                                          PROC B
  copy_hugetlb_page_range                         copy_hugetlb_page_range
    src_pte == huge_pte_offset                      src_pte == huge_pte_offset
    !src_pte so no sharing                          !src_pte so no sharing

  (time passes)

  hugetlb_fault                                   hugetlb_fault
    huge_pte_alloc                                  huge_pte_alloc
      huge_pmd_share                                 huge_pmd_share
        LOCK(i_mmap_mutex)
        find nothing, no sharing
        UNLOCK(i_mmap_mutex)
                                                      LOCK(i_mmap_mutex)
                                                      find nothing, no sharing
                                                      UNLOCK(i_mmap_mutex)
      pmd_alloc                                       pmd_alloc
      LOCK(instantiation_mutex)
      fault
      UNLOCK(instantiation_mutex)
                                                  LOCK(instantiation_mutex)
                                                  fault
                                                  UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed.  When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd.  This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:02 -07:00
Linus Torvalds
c71a35520f Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar.

A x32 socket ABI fix with a -stable backport tag among other fixes.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x32: Use compat shims for {g,s}etsockopt
  Revert "x86-64/efi: Use EFI to deal with platform wall clock"
  x86, apic: fix broken legacy interrupts in the logical apic mode
  x86, build: Globally set -fno-pic
  x86, avx: don't use avx instructions with "noxsave" boot param
2012-08-20 10:36:18 -07:00
Linus Torvalds
f78602ab7c Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: disable PEBS on a guest entry.
  perf/x86: Add Intel Westmere-EX uncore support
  perf/x86: Fixes for Nehalem-EX uncore driver
  perf, x86: Fix uncore_types_exit section mismatch
2012-08-20 10:34:21 -07:00
Mike Frysinger
515c7af85e x32: Use compat shims for {g,s}etsockopt
Some of the arguments to {g,s}etsockopt are passed in userland pointers.
If we try to use the 64bit entry point, we end up sometimes failing.

For example, dhcpcd doesn't run in x32:
	# dhcpcd eth0
	dhcpcd[1979]: version 5.5.6 starting
	dhcpcd[1979]: eth0: broadcasting for a lease
	dhcpcd[1979]: eth0: open_socket: Invalid argument
	dhcpcd[1979]: eth0: send_raw_packet: Bad file descriptor

The code in particular is getting back EINVAL when doing:
	struct sock_fprog pf;
	setsockopt(s, SOL_SOCKET, SO_ATTACH_FILTER, &pf, sizeof(pf));

Diving into the kernel code, we can see:
include/linux/filter.h:
	struct sock_fprog {
		unsigned short len;
		struct sock_filter __user *filter;
	};

net/core/sock.c:
	case SO_ATTACH_FILTER:
		ret = -EINVAL;
		if (optlen == sizeof(struct sock_fprog)) {
			struct sock_fprog fprog;

			ret = -EFAULT;
			if (copy_from_user(&fprog, optval, sizeof(fprog)))
				break;

			ret = sk_attach_filter(&fprog, sk);
		}
		break;

arch/x86/syscalls/syscall_64.tbl:
	54 common setsockopt sys_setsockopt
	55 common getsockopt sys_getsockopt

So for x64, sizeof(sock_fprog) is 16 bytes.  For x86/x32, it's 8 bytes.
This comes down to the pointer being 32bit for x32, which means we need
to do structure size translation.  But since x32 comes in directly to
sys_setsockopt, it doesn't get translated like x86.

After changing the syscall table and rebuilding glibc with the new kernel
headers, dhcp runs fine in an x32 userland.

Oddly, it seems like Linus noted the same thing during the initial port,
but I guess that was missed/lost along the way:
	https://lkml.org/lkml/2011/8/26/452

[ hpa: tagging for -stable since this is an ABI fix. ]

Bugzilla: https://bugs.gentoo.org/423649
Reported-by: Mads <mads@ab3.no>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Link: http://lkml.kernel.org/r/1345320697-15713-1-git-send-email-vapier@gentoo.org
Cc: H. J. Lu <hjl.tools@gmail.com>
Cc: <stable@vger.kernel.org> v3.4..v3.5
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-08-18 14:15:39 -07:00
Konrad Rzeszutek Wilk
250a41e0ec xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
If P2M leaf is completly packed with INVALID_P2M_ENTRY or with
1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf
with either a p2m_missing or p2m_identity respectively. The old
page (which was created via extend_brk or was grafted on from the
mfn_list) can be re-used for setting new PFNs.

This also means we can remove git commit:
5bc6f9888d
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back
which tried to fix this.

and make the amount that is required to be reserved much smaller.

CC: stable@vger.kernel.org # for 3.5 only.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-08-17 09:29:15 -04:00
Linus Torvalds
ad54e46113 Fix:
* On machines with large MMIO/PCI E820 spaces we fail to boot b/c
    we failed to pre-allocate large enough virtual space for extend_brk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQKlV9AAoJEFjIrFwIi8fJZh4H/0ZlRrgG+8mqwCM+pcyYY+2a
 zqnOrfYUO/aO26oqiOQUrn4quLAElhBuJK19uSj8fckMMZ+sr5rTJTaXmT6b7F7N
 pgTXsKQCYAJ2NNGHVSQ73KYjOUeEW3woDSQZo0y/GRzOjiQsxpoFc8PS94ZieUNT
 G6a8ECZBRv3fz8nAuJlhGV/suqHGOLJ0pwum1gHGOzaH3ZoZVtaQv5LhGYctJspU
 yF5bdeD0qjCbseVtJ72tyxzLxMwLpJtdy2MbSwIv5JGuszj0nRmL4oa7Vc4vYdyv
 p+FrNmbDAZ1j61z1PhBZPmgzwba2LTXtIWhR2zsGJgqlJNzMUtlNkff1kT3NeE0=
 =Gl6V
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen fix from Konrad Rzeszutek Wilk:
 "Way back in v3.5 we added a mechanism to populate back pages that were
  released (they overlapped with MMIO regions), but neglected to reserve
  the proper amount of virtual space for extend_brk to work properly.

  Coincidentally some other commit aligned the _brk space to larger area
  so I didn't trigger this until it was run on a machine with more than
  2GB of MMIO space."

 * On machines with large MMIO/PCI E820 spaces we fail to boot b/c
   we failed to pre-allocate large enough virtual space for extend_brk.

* tag 'stable/for-linus-3.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.
2012-08-16 11:31:59 -07:00
Konrad Rzeszutek Wilk
ca08649eb5 Revert "xen PVonHVM: move shared_info to MMIO before kexec"
This reverts commit 00e37bdb01.

During shutdown of PVHVM guests with more than 2VCPUs on certain
machines we can hit the race where the replaced shared_info is not
replaced fast enough and the PV time clock retries reading the same
area over and over without any any success and is stuck in an
infinite loop.

Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-08-16 13:05:25 -04:00
H. Peter Anvin
f026cfa82f Revert "x86-64/efi: Use EFI to deal with platform wall clock"
This reverts commit bacef661ac.

This commit has been found to cause serious regressions on a number of
ASUS machines at the least.  We probably need to provide a 1:1 map in
addition to the EFI virtual memory map in order for this to work.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reported-and-bisected-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu
2012-08-14 09:58:25 -07:00
Suresh Siddha
f1c6300183 x86, apic: fix broken legacy interrupts in the logical apic mode
Recent commit 332afa656e cleaned up
a workaround that updates irq_cfg domain for legacy irq's that
are handled by the IO-APIC. This was assuming that the recent
changes in assign_irq_vector() were sufficient to remove the workaround.

But this broke couple of AMD platforms. One of them seems to be
sending interrupts to the offline cpu's, resulting in spurious
"No irq handler for vector xx (irq -1)" messages when those cpu's come online.
And the other platform seems to always send the interrupt to the last logical
CPU (cpu-7). Recent changes had an unintended side effect of using only logical
cpu-0 in the IO-APIC RTE (during boot for the legacy interrupts) and this
broke the legacy interrupts not getting routed to the cpu-7 on the AMD
platform, resulting in a boot hang.

For now, reintroduce the removed workaround, (essentially not allowing the
vector to change for legacy irq's when io-apic starts to handle the irq. Which
also addressed the uninteded sife effect of just specifying cpu-0 in the
IO-APIC RTE for those irq's during boot).

Reported-and-tested-by: Robert Richter <robert.richter@amd.com>
Reported-and-tested-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1344453412.29170.5.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-08-14 09:52:20 -07:00
Gleb Natapov
26a4f3c08d perf/x86: disable PEBS on a guest entry.
If PMU counter has PEBS enabled it is not enough to disable counter
on a guest entry since PEBS memory write can overshoot guest entry
and corrupt guest memory. Disabling PEBS during guest entry solves
the problem.

Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120809085234.GI3341@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 19:01:04 +02:00