Commit Graph

262375 Commits

Author SHA1 Message Date
Huang Ying
ea8f5fb8a7 HWPoison: add memory_failure_queue()
memory_failure() is the entry point for HWPoison memory error
recovery.  It must be called in process context.  But commonly
hardware memory errors are notified via MCE or NMI, so some delayed
execution mechanism must be used.  In MCE handler, a work queue + ring
buffer mechanism is used.

In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
(Generic Hardware Error Source) can be used to report memory errors
too.  To add support to APEI GHES memory recovery, a mechanism similar
to that of MCE is implemented.  memory_failure_queue() is the new
entry point that can be called in IRQ context.  The next step is to
make MCE handler uses this interface too.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:15:58 -04:00
Huang Ying
152cef40a8 ACPI, APEI, GHES, Error records content based throttle
printk is used by GHES to report hardware errors.  Ratelimit is
enforced on the printk to avoid too many hardware error reports in
kernel log.  Because there may be thousands or even millions of
corrected hardware errors during system running.

Currently, a simple scheme is used.  That is, the total number of
hardware error reporting is ratelimited.  This may cause some issues
in practice.

For example, there are two kinds of hardware errors occurred in
system.  One is corrected memory error, because the fault memory
address is accessed frequently, there may be hundreds error report
per-second.  The other is corrected PCIe AER error, it will be
reported once per-second.  Because they share one ratelimit control
structure, it is highly possible that only memory error is reported.

To avoid the above issue, an error record content based throttle
algorithm is implemented in the patch.  Where after the first
successful reporting, all error records that are same are throttled for
some time, to let other kinds of error records have the opportunity to
be reported.

In above example, the memory errors will be throttled for some time,
after being printked.  Then the PCIe AER error will be printked
successfully.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:15:57 -04:00
Huang Ying
67eb2e9907 ACPI, APEI, GHES, printk support for recoverable error via NMI
Some APEI GHES recoverable errors are reported via NMI, but printk is
not safe in NMI context.

To solve the issue, a lock-less memory allocator is used to allocate
memory in NMI handler, save the error record into the allocated
memory, put the error record into a lock-less list.  On the other
hand, an irq_work is used to delay the operation from NMI context to
IRQ context.  The irq_work IRQ handler will remove nodes from
lock-less list, printk the error record and do some further processing
include recovery operation, then free the memory.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:15:57 -04:00
Huang Ying
7f184275aa lib, Make gen_pool memory allocator lockless
This version of the gen_pool memory allocator supports lockless
operation.

This makes it safe to use in NMI handlers and other special
unblockable contexts that could otherwise deadlock on locks.  This is
implemented by using atomic operations and retries on any conflicts.
The disadvantage is that there may be livelocks in extreme cases.  For
better scalability, one gen_pool allocator can be used for each CPU.

The lockless operation only works if there is enough memory available.
If new memory is added to the pool a lock has to be still taken.  So
any user relying on locklessness has to ensure that sufficient memory
is preallocated.

The basic atomic operation of this allocator is cmpxchg on long.  On
architectures that don't have NMI-safe cmpxchg implementation, the
allocator can NOT be used in NMI handler.  So code uses the allocator
in NMI handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:15:57 -04:00
Huang Ying
f49f23abf3 lib, Add lock-less NULL terminated single list
Cmpxchg is used to implement adding new entry to the list, deleting
all entries from the list, deleting first entry of the list and some
other operations.

Because this is a single list, so the tail can not be accessed in O(1).

If there are multiple producers and multiple consumers, llist_add can
be used in producers and llist_del_all can be used in consumers.  They
can work simultaneously without lock.  But llist_del_first can not be
used here.  Because llist_del_first depends on list->first->next does
not changed if list->first is not changed during its operation, but
llist_del_first, llist_add, llist_add (or llist_del_all, llist_add,
llist_add) sequence in another consumer may violate that.

If there are multiple producers and one consumer, llist_add can be
used in producers and llist_del_all or llist_del_first can be used in
the consumer.

This can be summarized as follow:

           |   add    | del_first |  del_all
 add       |    -     |     -     |     -
 del_first |          |     L     |     L
 del_all   |          |           |     -

Where "-" stands for no lock is needed, while "L" stands for lock is
needed.

The list entries deleted via llist_del_all can be traversed with
traversing function such as llist_for_each etc.  But the list entries
can not be traversed safely before deleted from the list.  The order
of deleted entries is from the newest to the oldest added one.  If you
want to traverse from the oldest to the newest, you must reverse the
order by yourself before traversing.

The basic atomic operation of this list is cmpxchg on long.  On
architectures that don't have NMI-safe cmpxchg implementation, the
list can NOT be used in NMI handler.  So code uses the list in NMI
handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:15:56 -04:00
Huang Ying
df013ffb81 Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
cmpxchg() is widely used by lockless code, including NMI-safe lockless
code.  But on some architectures, the cmpxchg() implementation is not
NMI-safe, on these architectures the lockless code may need a
spin_trylock_irqsave() based implementation.

This patch adds a Kconfig option: ARCH_HAVE_NMI_SAFE_CMPXCHG, so that
NMI-safe lockless code can depend on it or provide different
implementation according to it.

On many architectures, cmpxchg is only NMI-safe for several specific
operand sizes. So, ARCH_HAVE_NMI_SAFE_CMPXCHG define in this patch
only guarantees cmpxchg is NMI-safe for sizeof(unsigned long).

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Richard Henderson <rth@twiddle.net>
CC: Mikael Starvik <starvik@axis.com>
Acked-by: David Howells <dhowells@redhat.com>
CC: Yoshinori Sato <ysato@users.sourceforge.jp>
CC: Tony Luck <tony.luck@intel.com>
CC: Hirokazu Takata <takata@linux-m32r.org>
CC: Geert Uytterhoeven <geert@linux-m68k.org>
CC: Michal Simek <monstr@monstr.eu>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
CC: Kyle McMartin <kyle@mcmartin.ca>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Chen Liqin <liqin.chen@sunplusct.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ingo Molnar <mingo@redhat.com>
CC: Chris Zankel <chris@zankel.net>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03 11:12:37 -04:00
Shawn Guo
750f463a74 dt: add of_alias_scan and of_alias_get_id
The patch adds function of_alias_scan to populate a global lookup
table with the properties of 'aliases' node and function
of_alias_get_id for drivers to find alias id from the lookup table.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
[grant.likely: add locking and rework parse loop]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-08-03 13:02:31 +01:00
Kiran Patil
dcd998ccdb tcm_fc: Handle DDP/SW fc_frame_payload_get failures in ft_recv_write_data
Problem: HW DDP context was not invalidated in case of ABORTS, etc...
This leads to the problem where memory pages which are used for DDP
as user descriptor could get reused for some other purpose (such as to
satisfy new memory allocation request either by kernel or user mode threads)
and since HW DDP context was not invalidated, HW continue to write to
those pages, hence causing memory corruption.

Fix: Either on incoming ABORTS or due to exchange time out, allowed the
target to cleanup HW DDP context if it was setup for respective ft_cmd.
Added new function to perform this cleanup, furthur it can be enhanced
for other cleanup activity.  Fix ft_recv_write_data() to properly handle
fc_frame_payload_get to return pointer to payload if it exist. If there is
no payload which is most common case (+ve case in case if DDP is working
as expected, it will return NULL. Yes, scope of buf is limited to printk.
Invalidation of HW context (which is done inside ft_invl_hw_context() is
necessary in SUCCESS and FAILURE case of DDP. Hence invalidation is DONE
as long as there was DDP setup (whether it worked correctly or not,

NOTE: For some reason, if there is any error w.r.t DDP such as out of
order packet reception, HW simply post the full packet in rx queue.

Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Cc: Robert W Love <robert.w.love@intel.com>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
2011-08-03 09:38:21 +00:00
Linus Torvalds
f48d1915b8 Merge branch 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  efivars: fix warnings when CONFIG_PSTORE=n
2011-08-02 21:18:39 -10:00
Linus Torvalds
f673b7c2c5 Merge branch 'tools-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'tools-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  tools/power turbostat: fit output into 80 columns on snb-ep
  tools/power x86_energy_perf_policy: fix print of uninitialized string
2011-08-02 21:17:39 -10:00
Linus Torvalds
c299eba3c5 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (28 commits)
  ACPI:  delete stale reference in kernel-parameters.txt
  ACPI: add missing _OSI strings
  ACPI: remove NID_INVAL
  thermal: make THERMAL_HWMON implementation fully internal
  thermal: split hwmon lookup to a separate function
  thermal: hide CONFIG_THERMAL_HWMON
  ACPI print OSI(Linux) warning only once
  ACPI: DMI workaround for Asus A8N-SLI Premium and Asus A8N-SLI DELUX
  ACPI / Battery: propagate sysfs error in acpi_battery_add()
  ACPI / Battery: avoid acpi_battery_add() use-after-free
  ACPI: introduce "acpi_rsdp=" parameter for kdump
  ACPI: constify ops structs
  ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS
  ACPI: fix 80 char overflow
  ACPI / Battery: Resolve the race condition in the sysfs_remove_battery()
  ACPI / Battery: Add the check before refresh sysfs in the battery_notify()
  ACPI / Battery: Add the hibernation process in the battery_notify()
  ACPI / Battery: Rename acpi_battery_quirks2 with acpi_battery_quirks
  ACPI / Battery: Change 16-bit signed negative battery current into correct value
  ACPI / Battery: Add the power unit macro
  ...
2011-08-02 21:17:02 -10:00
Linus Torvalds
1850536b93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile/mm/init.c: trivial: use BUG_ON
  arch/tile: remove useless set_fixmap_nocache() macro
  arch/tile: add hypervisor-based character driver for SPI flash ROM
  ioctl-number.txt: add the tile hardwall ioctl range
  tile: use generic-y format for one-line asm-generic headers
  clocksource: tile: convert to use clocksource_register_hz
2011-08-02 21:16:11 -10:00
Linus Torvalds
ed8f37370d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
  Btrfs: don't call writepages from within write_full_page
  Btrfs: Remove unused variable 'last_index' in file.c
  Btrfs: clean up for find_first_extent_bit()
  Btrfs: clean up for wait_extent_bit()
  Btrfs: clean up for insert_state()
  Btrfs: remove unused members from struct extent_state
  Btrfs: clean up code for merging extent maps
  Btrfs: clean up code for extent_map lookup
  Btrfs: clean up search_extent_mapping()
  Btrfs: remove redundant code for dir item lookup
  Btrfs: make acl functions really no-op if acl is not enabled
  Btrfs: remove remaining ref-cache code
  Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
  Btrfs: use wait_event()
  Btrfs: check the nodatasum flag when writing compressed files
  Btrfs: copy string correctly in INO_LOOKUP ioctl
  Btrfs: don't print the leaf if we had an error
  btrfs: make btrfs_set_root_node void
  Btrfs: fix oops while writing data to SSD partitions
  Btrfs: Protect the readonly flag of block group
  ...

Fix up trivial conflicts (due to acl and writeback cleanups) in
 - fs/btrfs/acl.c
 - fs/btrfs/ctree.h
 - fs/btrfs/extent_io.c
2011-08-02 21:14:05 -10:00
Linus Torvalds
a6b11f5338 Merge branch 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6
* 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6:
  MAINTAINERS: Add keyword match for of_match_table to device tree section
  of: constify property name parameters for helper functions
  input: xilinx_ps2: Add missing of_address.h header
  of: address: use resource_size helper
2011-08-02 20:49:57 -10:00
Linus Torvalds
73a9fe86fa Merge branch 'spi/next' of git://git.secretlab.ca/git/linux-2.6
* 'spi/next' of git://git.secretlab.ca/git/linux-2.6:
  spi/pl022: remove function cannot exit
2011-08-02 20:49:38 -10:00
Linus Torvalds
f3406816bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (34 commits)
  dm table: set flush capability based on underlying devices
  dm crypt: optionally support discard requests
  dm raid: add md raid1 support
  dm raid: support metadata devices
  dm raid: add write_mostly parameter
  dm raid: add region_size parameter
  dm raid: improve table parameters documentation
  dm ioctl: forbid multiple device specifiers
  dm ioctl: introduce __get_dev_cell
  dm ioctl: fill in device parameters in more ioctls
  dm flakey: add corrupt_bio_byte feature
  dm flakey: add drop_writes
  dm flakey: support feature args
  dm flakey: use dm_target_offset and support discards
  dm table: share target argument parsing functions
  dm snapshot: skip reading origin when overwriting complete chunk
  dm: ignore merge_bvec for snapshots when safe
  dm table: clean dm_get_device and move exports
  dm raid: tidy includes
  dm ioctl: prevent empty message
  ...
2011-08-02 20:49:21 -10:00
Linus Torvalds
4400478ba3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-watchdog
* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-watchdog:
  watchdog: Cleanup WATCHDOG_CORE help text
  watchdog: Fix POST failure on ASUS P5N32-E SLI and similar boards
  watchdog: shwdt: fix usage of mod_timer
2011-08-02 20:48:47 -10:00
Linus Torvalds
0d7e92da50 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: asihpi - Clarify adapter index validity check
  ALSA: asihpi - Don't leak firmware if mem alloc fails
  ALSA: rtctimer.c needs module.h
  ASoC: Fix txx9aclc.c build
  ALSA: hdspm - Add firmware revision 0xcc for RME MADI
  ALSA: hdspm - Fix reported external sample rate on RME MADI and MADIface
  ALSA: hdspm - Provide MADI speed mode selector on RME MADI and MADIface
  ALSA: sound/core/pcm_compat.c: adjust array index
2011-08-02 20:48:22 -10:00
Len Brown
d30c4b7a87 tools/power turbostat: fit output into 80 columns on snb-ep
Reduce columns for package number to 1.
If you can afford more than 9 packages,
you can also afford a terminal with more than 80 columns:-)

Also shave a column also off the package C-states

Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 18:33:31 -04:00
Tony Luck
b728a5c806 efivars: fix warnings when CONFIG_PSTORE=n
drivers/firmware/efivars.c:161: warning: ‘utf16_strlen’ defined but not used
utf16_strlen() is only used inside CONFIG_PSTORE - make this "static inline"
to shut the compiler up [thanks to hpa for the suggestion].

drivers/firmware/efivars.c:602: warning: initialization from incompatible pointer type
Between v1 and v2 of this patch series we decided to make the "part" number
unsigned - but missed fixing the stub version of efi_pstore_write()

Acked-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Mike Waychison <mikew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-08-02 15:08:30 -07:00
Len Brown
4a8f5058bd Merge branches 'acpica', 'battery', 'boot-irqs', 'bz-24492', 'bz-9528', 'from-akpm', 'kexec-param' and 'misc' into release
Conflicts:
	Documentation/kernel-parameters.txt

Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 17:22:09 -04:00
Len Brown
6c8111e9a0 ACPI: delete stale reference in kernel-parameters.txt
Says for acpi=
                        See also Documentation/power/pm.txt, pci=noacpi

but this file does not exist

Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 17:15:33 -04:00
Julia Lawall
d1afa65ca5 arch/tile/mm/init.c: trivial: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);

@@ identifier x; @@
-if (!x) BUG();
+BUG_ON(!x);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2011-08-02 16:26:59 -04:00
Chris Metcalf
3d071cd313 Merge tag 'v3.0' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus 2011-08-02 16:14:02 -04:00
Shaohua Li
aa165971c2 ACPI: add missing _OSI strings
Linux supports some optional features, but it should notify the BIOS about
them via the _OSI method.  Currently Linux doesn't notify any, which might
make such features not work because the BIOS doesn't know about them.

Jarosz has a system which needs this to make ACPI processor aggregator
device work.

Reported-by: "Jarosz, Sebastian" <sebastian.jarosz@intel.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 14:52:30 -04:00
David Rientjes
f52e00c668 ACPI: remove NID_INVAL
b552a8c56d ("ACPI: remove NID_INVAL") removed the left over uses of
NID_INVAL, but didn't actually remove the definition.  Remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 14:52:13 -04:00
Jean Delvare
31f5396ad3 thermal: make THERMAL_HWMON implementation fully internal
THERMAL_HWMON is implemented inside the thermal_sys driver and has no
effect on drivers implementing thermal zones, so they shouldn't see
anything related to it in <linux/thermal.h>.  Making the THERMAL_HWMON
implementation fully internal has two advantages beyond the cleaner
design:

* This avoids rebuilding all thermal drivers if the THERMAL_HWMON
  implementation changes, or if CONFIG_THERMAL_HWMON gets enabled or
  disabled.

* This avoids breaking the thermal kABI in these cases too, which should
  make distributions happy.

The only drawback I can see is slightly higher memory fragmentation, as
the number of kzalloc() calls will increase by one per thermal zone.  But
I doubt it will be a problem in practice, as I've never seen a system with
more than two thermal zones.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Rene Herman <rene.herman@gmail.com>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 14:51:57 -04:00
Jean Delvare
0d97d7a494 thermal: split hwmon lookup to a separate function
We'll soon need to reuse it.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Rene Herman <rene.herman@gmail.com>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 14:51:41 -04:00
Jean Delvare
ab92402af0 thermal: hide CONFIG_THERMAL_HWMON
It's about time to revert 16d7523973 ("thermal: Create
CONFIG_THERMAL_HWMON=n").  Anybody running a kernel >= 2.6.40 would also
be running a recent enough version of lm-sensors.

Actually having CONFIG_THERMAL_HWMON is pretty convenient so instead of
dropping it, we keep it but hide it.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Rene Herman <rene.herman@gmail.com>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-02 14:51:25 -04:00
Mark Brown
d945fa0da7 MAINTAINERS: Add keyword match for of_match_table to device tree section
If a patch is working with an of_match_table it's probably adding an OF
binding to a driver in which case the binding ought to be reviewed by the
device tree folks to make sure that things like the device matches are set
up correctly.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-08-02 17:03:41 +01:00
Jamie Iles
aac285c6cb of: constify property name parameters for helper functions
The helper functions for reading u32 integers, u32 arrays and strings
should have the property name as a const pointer.

Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-08-02 16:50:47 +01:00
Linus Walleij
50658b6602 spi/pl022: remove function cannot exit
The remove function in the PL022 driver cannot abort the remove
function any way, so restructure the code so as not to make that
assumption. Remove will now proceed no matter whether it can
stop the transfer queue or not.

Reported-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-08-02 14:54:11 +01:00
Mike Snitzer
ed8b752bcc dm table: set flush capability based on underlying devices
DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
regardless of whether or not a given DM device's underlying devices
also advertised a need for them.

Block's flush-merge changes from 2.6.39 have proven to be more costly
for DM devices.  Performance regressions have been reported even when
DM's underlying devices do not advertise that they have a write cache.

Fix the performance regressions by configuring a DM device's flushing
capabilities based on those of the underlying devices' capabilities.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Milan Broz
772ae5f54d dm crypt: optionally support discard requests
Add optional parameter field to dmcrypt table and support
"allow_discards" option.

Discard requests bypass crypt queue processing. Bio is simple remapped
to underlying device.

Note that discard will be never enabled by default because of security
consequences.  It is up to the administrator to enable it for encrypted
devices.

(Note that userspace cryptsetup does not understand new optional
parameters yet.  Support for this will come later.  Until then, you
should use 'dmsetup' to enable and disable this.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Jonathan Brassow
327372797c dm raid: add md raid1 support
Support the MD RAID1 personality through dm-raid.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
b12d437b73 dm raid: support metadata devices
Add the ability to parse and use metadata devices to dm-raid.  Although
not strictly required, without the metadata devices, many features of
RAID are unavailable.  They are used to store a superblock and bitmap.

The role, or position in the array, of each device must be recorded in
its superblock.  This is to help with fault handling, array reshaping,
and sanity checks.  RAID 4/5/6 devices must be loaded in a specific order:
in this way, the 'array_position' field helps validate the correctness
of the mapping when it is loaded.  It can be used during reshaping to
identify which devices are added/removed.  Fault handling is impossible
without this field.  For example, when a device fails it is recorded in
the superblock.  If this is a RAID1 device and the offending device is
removed from the array, there must be a way during subsequent array
assembly to determine that the failed device was the one removed.  This
is done by correlating the 'array_position' field and the bit-field
variable 'failed_devices'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
46bed2b5c1 dm raid: add write_mostly parameter
Add the write_mostly parameter to RAID1 dm-raid tables.

This allows the user to set the WriteMostly flag on a RAID1 device that
should normally be avoided for read I/O.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
c1084561bb dm raid: add region_size parameter
Allow the user to specify the region_size.

Ensures that the supplied value meets md's constraints, viz. the number of
regions does not exceed 2^21.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
c0a2fa1ef1 dm raid: improve table parameters documentation
Add more information about some dm-raid table parameters and clarify how
parameters are printed when 'dmsetup table' is issued.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka
759dea204c dm ioctl: forbid multiple device specifiers
Exactly one of name, uuid or device must be specified when referencing
an existing device.  This removes the ambiguity (risking the wrong
device being updated) if two conflicting parameters were specified.
Previously one parameter got used and any others were ignored silently.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka
ba2e19b0f4 dm ioctl: introduce __get_dev_cell
Move logic to find device based on major/minor number to a separate
function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell).
This makes the function __find_device_hash_cell more straightforward.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka
0ddf9644cc dm ioctl: fill in device parameters in more ioctls
Move parameter filling from find_device to __find_device_hash_cell.

This patch causes ioctls using __find_device_hash_cell
(DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD)
to return device parameters, bringing them into line with the other
ioctls.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer
a3998799fb dm flakey: add corrupt_bio_byte feature
Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a
specified position with a specified value during intervals when the device is
"down".

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer
b26f5e3d71 dm flakey: add drop_writes
Add 'drop_writes' option to drop writes silently while the
device is 'down'.  Reads are not touched.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
dfd068b01f dm flakey: support feature args
Add the ability to specify arbitrary feature flags when creating a
flakey target.  This code uses the same target argument helpers that
the multipath target does.

Also remove the superfluous 'dm-flakey' prefixes from the error messages,
as they already contain the prefix 'flakey'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
30e4171bfe dm flakey: use dm_target_offset and support discards
Use dm_target_offset() and support discards.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
498f0103ea dm table: share target argument parsing functions
Move multipath target argument parsing code into dm-table so other
targets can share it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka
a6e50b409d dm snapshot: skip reading origin when overwriting complete chunk
If we write a full chunk in the snapshot, skip reading the origin device
because the whole chunk will be overwritten anyway.

This patch changes the snapshot write logic when a full chunk is written.
In this case:
  1. allocate the exception
  2. dispatch the bio (but don't report the bio completion to device mapper)
  3. write the exception record
  4. report bio completed

Callbacks must be done through the kcopyd thread, because callbacks must not
race with each other.  So we create two new functions:

  dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback.
  (This function must not be called from interrupt context.)

  dm_kcopyd_do_callback: submit callback.
  (This function may be called from interrupt context.)

Performance test (on snapshots with 4k chunk size):
  without the patch:
    non-direct-io sequential write (dd):    17.7MB/s
    direct-io sequential write (dd):        20.9MB/s
    non-direct-io random write (mkfs.ext2): 0.44s

  with the patch:
    non-direct-io sequential write (dd):    26.5MB/s
    direct-io sequential write (dd):        33.2MB/s
    non-direct-io random write (mkfs.ext2): 0.27s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka
d5b9dd04bd dm: ignore merge_bvec for snapshots when safe
Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
whether the device can accept bios larger than the size its merge
function returns.  When set, use this to send large bios to snapshots
which can split them if necessary.  Snapshot I/O may be significantly
fragmented and this approach seems to improve peformance.

Before the patch, dm_set_device_limits restricted bio size to page size
if the underlying device had a merge function and the target didn't
provide a merge function.  After the patch, dm_set_device_limits
restricts bio size to page size if the underlying device has a merge
function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
provide a merge function.

The snapshot target can't provide a merge function because when the merge
function is called, it is impossible to determine where the bio will be
remapped.  Previously this led us to impose a 4k limit, which we can
now remove if the snapshot store is located on a device without a merge
function.  Together with another patch for optimizing full chunk writes,
it improves performance from 29MB/s to 40MB/s when writing to the
filesystem on snapshot store.

If the snapshot store is placed on a non-dm device with a merge function
(such as md-raid), device mapper still limits all bios to page size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mike Snitzer
0864901254 dm table: clean dm_get_device and move exports
There is no need for __table_get_device to be factored out.
Also move the exports to the end of their respective functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00