kernel-ark

Author	SHA1	Message	Date
Paul Jimenez	afb24528f9	dm: table use list_for_each This patch is some minor janitorish cleanup, using some macros from linux/list.h (already #included via dm.h) to improve readability. Signed-off-by: Paul Jimenez <pj@place.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:59 +00:00
Milan Broz	76c072b48e	dm ioctl: move compat code Move compat_ioctl handling into dm-ioctl.c. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:56 +00:00
Alasdair G Kergon	27238b2bea	dm ioctl: remove lock_kernel Remove lock_kernel() from the device-mapper ioctls - there should be sufficient internal locking already where required. Also remove some superfluous casts. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:53 +00:00
Alasdair G Kergon	b9249e5568	dm: mark function lists static Add a couple of statics. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:51 +00:00
Milan Broz	7e5c1e830b	dm: add missing memory barrier to dm_suspend Add memory barrier to fix atomic_read of pending value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:49 +00:00
NeilBrown	6ed3003c19	md: fix an occasional deadlock in raid5 raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Neil said: How about we queue it for 2.6.25-rc1 and then about when -rc2 comes out, we queue it for 2.6.24.y? Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Tested-by: dean gaudet <dean@arctic.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	73c34431c7	md: change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING. Finish ITERATE_ to for_each conversion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	d089c6af10	md: change ITERATE_RDEV to rdev_for_each As this is more in line with common practice in the kernel. Also swap the args around to be more like list_for_each. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	29ac4aa3fc	md: change INTERATE_MDDEV to for_each_mddev As this is more consistent with kernel style. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	20a49ff679	md: change a few 'int' to 'size_t' in md As suggested by Andrew Morton. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	177a99b23e	md: fix use-after-free bug when dropping an rdev from an md array Due to possible deadlock issues we need to use a schedule work to kobject_del an 'rdev' object from a different thread. A recent change means that kobject_add no longer gets a refernce, and kobject_del doesn't put a reference. Consequently, we need to explicitly hold a reference to ensure that the last reference isn't dropped before the scheduled work get a chance to call kobject_del. Also, rename delayed_delete to md_delayed_delete to that it is more obvious in a stack trace which code is to blame. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	a17184a911	md: allow an md array to appear with 0 drives if it has external metadata Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	ca38805945	md: lock address when changing attributes of component devices Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c5d79adba7	md: allow devices to be shared between md arrays Currently, a given device is "claimed" by a particular array so that it cannot be used by other arrays. This is not ideal for DDF and other metadata schemes which have their own partitioning concept. So for externally managed metadata, just claim the device for md in general, require that "offset" and "size" are set properly for each device, and make sure that if a device is included in different arrays then the active sections do not overlap. This involves adding another flag to the rdev which makes it awkward to set "->flags = 0" to clear certain flags. So now clear flags explicitly by name when we want to clear things. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	1ec4a9398d	md: set and test the ->persistent flag for md devices more consistently If you try to start an array for which the number of raid disks is listed as zero, md will currently try to read metadata off any devices that have been given. This was done because the value of raid_disks is used to signal whether array details have been provided by userspace (raid_disks > 0) or must be read from the devices (raid_disks == 0). However for an array without persistent metadata (or with externally managed metadata) this is the wrong thing to do. So we add a test in do_md_run to give an error if raid_disks is zero for non-persistent arrays. This requires that mddev->persistent is set corrently at this point, which it currently isn't for in-kernel autodetected arrays. So set ->persistent for autodetect arrays, and remove the settign in super_*_validate which is now redundant. Also clear ->persistent when stopping an array so it is consistently zero when starting an array. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c620727779	md: allow a maximum extent to be set for resyncing This allows userspace to control resync/reshape progress and synchronise it with other activities, such as shared access in a SAN, or backing up critical sections during a tricky reshape. Writing a number of sectors (which must be a multiple of the chunk size if such is meaningful) causes a resync to pause when it gets to that point. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c303da6d71	md: give userspace control over removing failed devices when external metdata in use When a device fails, we must not allow an further writes to the array until the device failure has been recorded in array metadata. When metadata is managed externally, this requires some synchronisation... Allow/require userspace to explicitly remove failed devices from active service in the array by writing 'none' to the 'slot' attribute. If this reduces the number of failed devices to 0, the write block will automatically be lowered. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	e691063a61	md: support 'external' metadata for md arrays - Add a state flag 'external' to indicate that the metadata is managed externally (by user-space) so important changes need to be left of user-space to handle. Alternates are non-persistant ('none') where there is no stable metadata - after the array is stopped there is no record of it's status - and internal which can be version 0.90 or version 1.x These are selected by writing to the 'metadata' attribute. - move the updating of superblocks (sync_sbs) to after we have checked if there are any superblocks or not. - New array state 'write_pending'. This means that the metadata records the array as 'clean', but a write has been requested, so the metadata has to be updated to record a 'dirty' array before the write can continue. This change is reported to md by writing 'active' to the array_state attribute. - tidy up marking of sb_dirty: - don't set sb_dirty when resync finishes as md_check_recovery calls md_update_sb when the sync thread finishes anyway. - Don't set sb_dirty in multipath_run as the array might not be dirty. - don't mark superblock dirty when switching to 'clean' if there is no internal superblock (if external, userspace can choose to update the superblock whenever it chooses to). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	b47490c9bc	md: Update md bitmap during resync. Currently an md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciably affect resync performance. [snitzer@gmail.com: export bitmap_cond_end_sync] Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
H. Peter Anvin	66c811e993	md: raid6: clean up the style of raid6test/test.c Clean up the coding style in raid6test/test.c. Break it apart into subfunctions to make the code more readable. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
H. Peter Anvin	98ec302be5	md: raid6: Fix mktable.c Make both mktables.c and its output CodingStyle compliant. Update the copyright notice. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
Oliver Pinter	54212cf405	coding style cleanups for drivers/md/mktables.c Signed-off-by: Oliver Pinter <oliver.pntr@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
Greg Kroah-Hartman	c10997f657	Kobject: convert drivers/* from kobject_unregister() to kobject_put() There is no need for kobject_unregister() anymore, thanks to Kay's kobject cleanup changes, so replace all instances of it with kobject_put(). Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:40 -08:00
Greg Kroah-Hartman	f9cb074bff	Kobject: rename kobject_init_ng() to kobject_init() Now that the old kobject_init() function is gone, rename kobject_init_ng() to kobject_init() to clean up the namespace. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman	b2d6db5878	Kobject: rename kobject_add_ng() to kobject_add() Now that the old kobject_add() function is gone, rename kobject_add_ng() to kobject_add() to clean up the namespace. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman	649316b25b	Kobject: convert drivers/md/md.c to use kobject_init/add_ng() This converts the code to use the new kobject functions, cleaning up the logic in doing so. Cc: Neil Brown <neilb@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:37 -08:00
Kay Sievers	edfaa7c365	Driver core: convert block from raw kobjects to core devices This moves the block devices to /sys/class/block. It will create a flat list of all block devices, with the disks and partitions in one directory. For compatibility /sys/block is created and contains symlinks to the disks. /sys/class/block \|-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda \|-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1 \|-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10 \|-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5 \|-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6 \|-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7 \|-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8 \|-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9 `-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0 /sys/block/ \|-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda `-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0 Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman	3830c62fef	Kobject: change drivers/md/md.c to use kobject_init_and_add Stop using kobject_register, as this way we can control the sending of the uevent properly, after everything is properly initialized. Cc: Neil Brown <neilb@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:29 -08:00
Dan Williams	0f94e87cde	md: fix data corruption when a degraded raid5 array is reshaped We currently do not wait for the block from the missing device to be computed from parity before copying data to the new stripe layout. The change in the raid6 code is not techincally needed as we don't delay data block recovery in the same way for raid6 yet. But making the change now is safer long-term. This bug exists in 2.6.23 and 2.6.24-rc Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:35 -08:00
Milan Broz	91e1062592	dm crypt: use bio_add_page Fix possible max_phys_segments violation in cloned dm-crypt bio. In write operation dm-crypt needs to allocate new bio request and run crypto operation on this clone. Cloned request has always the same size, but number of physical segments can be increased and violate max_phys_segments restriction. This can lead to data corruption and serious hardware malfunction. This was observed when using XFS over dm-crypt and at least two HBA controller drivers (arcmsr, cciss) recently. Fix it by using bio_add_page() call (which tests for other restrictions too) instead of constructing own biovec. All versions of dm-crypt are affected by this bug. Cc: stable@kernel.org Cc: dm-crypt@saout.de Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:13 +00:00
Neil Brown	91212507f9	dm: merge max_hw_sector Make sure dm honours max_hw_sectors of underlying devices We still have no firm testing evidence in support of this patch but believe it may help to resolve some bug reports. - agk Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:12 +00:00
Alasdair G Kergon	69267a30be	dm: trigger change uevent on rename Insert a missing KOBJ_CHANGE notification when a device is renamed. Cc: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:11 +00:00
Milan Broz	adfe47702c	dm crypt: fix write endio Fix BIO_UPTODATE test for write io. Cc: stable@kernel.org Cc: dm-crypt@saout.de Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:10 +00:00
Paul Mundt	d1622e8909	dm mpath: hp requires scsi With CONFIG_SCSI=n __scsi_print_sense() is never linked in. drivers/built-in.o: In function `hp_sw_end_io': dm-mpath-hp-sw.c:(.text+0x914f8): undefined reference to `__scsi_print_sense' Caught with a randconfig on current git. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:09 +00:00
Jun'ichi Nomura	512875bd96	dm: table detect io beyond device This patch fixes a panic on shrinking a DM device if there is outstanding I/O to the part of the device that is being removed. (Normally this doesn't happen - a filesystem would be resized first, for example.) The bug is that __clone_and_map() assumes dm_table_find_target() always returns a valid pointer. It may fail if a bio arrives from the block layer but its target sector is no longer included in the DM btree. This patch appends an empty entry to table->targets[] which will be returned by a lookup beyond the end of the device. After calling dm_table_find_target(), __clone_and_map() and target_message() check for this condition using dm_target_is_valid(). Sample test script to trigger oops:	2007-12-20 17:32:08 +00:00
Dan Williams	6c55be8b96	raid5: fix unending write sequence <debug output from Joel's system> handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000 check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000 check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000 check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000 locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 for sector 7629696, rmw=0 rcw=0 </debug> These blocks were prepared to be written out, but were never handled in ops_run_biodrain(), so they remain locked forever. The operations flags are all clear which means handle_stripe() thinks nothing else needs to be done. This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it should not have been. This patch cleans up cases where the code looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. Report from Joel: Resync done. Patch fix this bug. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Cc: <stable@kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Alan D. Brunelle	2ad8b1ef11	Add UNPLUG traces to all appropriate places Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-09 13:41:32 +01:00
Neil Brown	def6ae26a9	md: fix misapplied patch in raid5.c commit `4ae3f847e4` ("md: raid5: fix clearing of biofill operations") did not get applied correctly, presumably due to substantial similarities between handle_stripe5 and handle_stripe6. This patch moves the chunk of new code from handle_stripe6 (where it isn't needed (yet)) to handle_stripe5. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Dan Williams" <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-05 15:12:32 -08:00
Vasily Averin	5ec140e600	dm: bounce_pfn limit added Device mapper uses its own bounce_pfn that may differ from one on underlying device. In that way dm can build incorrect requests that contain sg elements greater than underlying device is able to handle. This is the cause of slab corruption in i2o layer, occurred on i386 arch when very long direct IO requests are addressed to dm-over-i2o device. Signed-off-by: Vasily Averin <vvs@sw.ru> Cc: <stable@kernel.org> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-02 08:47:25 +01:00
Al Viro	ca5cd877ae	x86 merge fallout: uml Don't undef __i386__/__x86_64__ in uml anymore, make sure that (few) places that required adjusting the ifdefs got those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 07:41:32 -07:00
Herbert Xu	68e3f5dd4d	[CRYPTO] users: Fix up scatterlist conversion errors This patch fixes the errors made in the users of the crypto layer during the sg_init_table conversion. It also adds a few conversions that were missing altogether. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-27 00:52:07 -07:00
Jens Axboe	642f149031	SG: Change sg_set_page() to take length and offset argument Most drivers need to set length and offset as well, so may as well fold those three lines into one. Add sg_assign_page() for those two locations that only needed to set the page, where the offset/length is set outside of the function context. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-24 11:20:47 +02:00
Dan Williams	4ae3f847e4	md: raid5: fix clearing of biofill operations ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the 'pending' and 'ack' bits. Since the test_and_ack_op() macro only checks against 'complete' it can get an inconsistent snapshot of pending work. Move the clearing of these bits to handle_stripe5(), under the lock. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Stable <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-23 08:32:06 -07:00
NeilBrown	85bfb4da8c	md: fix an unsigned compare to allow creation of bitmaps with v1.0 metadata As page->index is unsigned, this all becomes an unsigned comparison, which almost always returns an error. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Stable <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-23 08:32:06 -07:00
Jens Axboe	45711f1af6	[SG] Update drivers to use sg helpers Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-22 21:19:53 +02:00
Linus Torvalds	c00046c279	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (74 commits) fix do_sys_open() prototype sysfs: trivial: fix sysfs_create_file kerneldoc spelling mistake Documentation: Fix typo in SubmitChecklist. Typo: depricated -> deprecated Add missing profile=kvm option to Documentation/kernel-parameters.txt fix typo about TBI in e1000 comment proc.txt: Add /proc/stat field small documentation fixes Fix compiler warning in smount example program from sharedsubtree.txt docs/sysfs: add missing word to sysfs attribute explanation documentation/ext3: grammar fixes Documentation/java.txt: typo and grammar fixes Documentation/filesystems/vfs.txt: typo fix include/asm-*/system.h: remove unused set_rmb(), set_wmb() macros trivial copy_data_pages() tidy up Fix typo in arch/x86/kernel/tsc_32.c file link fix for Pegasus USB net driver help remove unused return within void return function Typo fixes retrun -> return x86 hpet.h: remove broken links ...	2007-10-19 20:36:17 -07:00
Milan Broz	80fd662683	dm crypt: tidy pending Add crypt prefix to dec_pending to avoid confusing it in backtraces with the dm core function of the same name. No functional change here. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:28 +01:00
Mike Anderson	b15546f942	dm mpath: send uevents This patch adds calls to dm_path_event for a failed path and a reinstated path. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:27 +01:00
Mike Anderson	7a8c3d3b92	dm: uevent generate events This patch adds support for the dm_path_event dm_send_event functions which create and send udev events. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:26 +01:00
Mike Anderson	51e5b2bd34	dm: add uevent to core This patch adds a uevent skeleton to device-mapper. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:24 +01:00
Mike Anderson	96a1f7dba6	dm: export name and uuid This patch adds a function to obtain a copy of a mapped device's name and uuid. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:23 +01:00
Jonathan Brassow	aa5617c553	dm raid1: add mirror_set to struct mirror Store a pointer to the owning mirror_set structure within each mirror structure for a subsequent patch to use. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:22 +01:00
Jonathan Brassow	6b3df0d7a5	dm log: split suspend There are now two phases to a suspend in device-mapper - presuspend and postsuspend. This patch removes the single 'suspend' in the logging API and replaces it with 'presuspend' and 'postsuspend' functions to align it better with core device-mapper. A subsequent patch will make use of 'presuspend'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:21 +01:00
Dave Wysochanski	fe97e2aa05	dm mpath: hp retry if not ready This patch adds retries to the hp hardware handler, and utilizes the MP_RETRY flag of dm-multipath. For now in the hp handler, if we get a pg_init completed with a check condition we just assume we can retry the pg_init command. We make this assumption because of incomplete data on specific check condition code of the HP hardware, and because testing has shown the HP path initialization command to be idempotent. The number of times we retry is settable via the "pg_init_retries" multipath map feature. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:20 +01:00
Dave Wysochanski	16ebbf3584	dm mpath: add hp handler This patch adds the most basic dm-multipath hardware support for the HP active/passive arrays. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:19 +01:00
Dave Wysochanski	c9e45581ad	dm mpath: add retry pg init This patch allows a failed path group initialisation command to be retried. It adds a generic MP_RETRY flag and a "pg_init_retries" feature to device-mapper multipath which limits the number of retries. 1. A hw handler sends a path initialization command to the storage and the command completes with an error code indicating the command should be retried. 2. The hardware handler calls dm_pg_init_complete() with MP_RETRY set in err_flags to ask the dm multipath core to retry. 3. If the retry limit has not been exceeded, pg_init() is retried. Otherwise fail_path() is called. If you are using the userspace multipath-tools or device-mapper-multipath package, you can set pg_init_retries in the 'device' section of your /etc/multipath.conf file. For example: features "2 pg_init_retries 7" The number of PG retries attempted is reported in the 'dmsetup status' output. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:18 +01:00
Milan Broz	636d5786c4	dm crypt: tidy labels Replace numbers with names in labels in error paths, to avoid confusion when new one get added between existing ones. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:17 +01:00
Milan Broz	d469f84197	dm crypt: tidy whitespace Clean up, convert some spaces to tabs. No functional change here. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:15 +01:00
Milan Broz	cabf08e4d3	dm crypt: add post processing queue Add post-processing queue (per crypt device) for read operations. Current implementation uses only one queue for all operations and this can lead to starvation caused by many requests waiting for memory allocation. But the needed memory-releasing operation is queued after these requests (in the same queue). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:14 +01:00
Milan Broz	9934a8bea2	dm crypt: use per device singlethread workqueues Use a separate single-threaded workqueue for each crypt device instead of one global workqueue. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:13 +01:00
Alasdair G Kergon	d336416ff1	dm mpath: emc fix an error message Correct an error message, reported by Michael Wood <michael@frogfoot.com>. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:12 +01:00
Alasdair G Kergon	051814c69f	dm: bio_list macro renaming Remove BIO_LIST and DEFINE_BIO_LIST macros that gain us nothing since contents are initialised to NULL. Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:11 +01:00
Jesper Juhl	bb56acf840	dm io:ctl remove vmalloc void cast In drivers/md/dm-ioctl.c::copy_params() there's a call to vmalloc() where we currently cast the return value, but that's pretty pointless given that vmalloc() returns "void *". Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:10 +01:00
Milan Broz	9e4e5f87eb	dm: tidy bio_io_error usage Use bio_io_error() in only two places and tidy the code, preparing for later patches. There is no functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:09 +01:00
Matthias Kaehlcke	def5b5b26e	kcopyd use mutex instead of semaphore Kcopyd uses a semaphore as mutex. Use the mutex API instead of the (binary) semaphore, Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2007-10-20 02:01:08 +01:00
Dmitry Monakhov	094262db9e	dm: use kzalloc Convert kmalloc() + memset() to kzalloc(). Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:07 +01:00
vignesh babu	6f3c3f0afa	dm: use is_power_of_2 Replacing n & (n - 1) for power of 2 check by is_power_of_2(n) Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:06 +01:00
Jun'ichi Nomura	ae9da83f6d	dm: fix thaw_bdev This patch fixes a bd_mount_sem counter corruption bug in device-mapper. thaw_bdev() should be called only when freeze_bdev() was called for the device. Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter. struct block_device with the corrupted semaphore may remain in slab cache and be reused later. Attached patch will fix it by calling unlock_fs() instead. unlock_fs() will determine whether it should call thaw_bdev() by checking the device is frozen or not. Easy reproducer is: #!/bin/sh while [ 1 ]; do dmsetup --notable create a dmsetup --nolockfs suspend a dmsetup remove a done It's not easy to see the effect of corrupted semaphore. So I have tested with putting printk below in bdev_alloc_inode(): if (atomic_read(&ei->bdev.bd_mount_sem.count) != 1) printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n", atomic_read(&ei->bdev.bd_mount_sem.count), &ei->bdev); Without the patch, I saw something like: Incorrect semaphore count = 17 (f2ab91c0) With the patch, the message didn't appear. The bug was introduced in 2.6.16 with this bug fix: commit `d9dde59ba0` Date: Fri Feb 24 13:04:24 2006 -0800 [PATCH] dm: missing bdput/thaw_bdev at removal Need to unfreeze and release bdev otherwise the bdev inode with inconsistent state is reused later and cause problem. and backported to 2.6.15.5. It occurs only in free_dev(), which is called only when the dm device is removed. The buggy code is executed only if md->suspended_bdev is non-NULL and that can happen only when the device was suspended without noflush. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2007-10-20 02:01:05 +01:00
Milan Broz	79662d1ea3	dm delay: fix status Fix missing space in dm-delay target status output if separate read and write delay are configured. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:04 +01:00
Dmitry Monakhov	2e64a0f928	dm delay: fix ctr error paths Add missing 'dm_put_device' to dm-delay target constructor. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:02 +01:00
Dmitry Monakhov	a72cf737e0	dm raid1: fix leakage Add missing 'dm_io_client_destroy' to alloc_context error path. Reorganize mirror constructor error path in order to prevent workqueue leakage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:01 +01:00
Dmitry Monakhov	815f9e3270	dm crypt: missing kfree in ctr error path Insert missing kfree() in crypt_iv_essiv_ctr() error path. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:01 +01:00
Dmitry Monakhov	55b42c5ae9	dm crypt: drop device ref in ctr error path Add a missing 'dm_put_device' in an error path in crypt target constructor. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:00:59 +01:00
Milan Broz	027d50f92e	dm io:ctl use constant struct size Make size of dm_ioctl struct always 312 bytes on all supported architectures. This change retains compatibility with already-compiled code because it uses an embedded offset to locate the payload that follows the structure. On 64-bit architectures there is no change at all; on 32-bit we are increasing the size of dm-ioctl from 308 to 312 bytes. Currently with 32-bit userspace / 64-bit kernel on x86_64 some ioctls (including rename, message) are incorrectly rejected by the comparison against 'param + 1'. This breaks userspace lvrename and multipath 'fail_if_no_path' changes, for example. (BTW Device-mapper uses its own versioning and ignores the ioctl size bits. Only the generic ioctl compat code on mixed arches checks them, and that will continue to accept both sizes for now, but we intend to list 308 as deprecated and eventually remove it.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Guido Guenther <agx@sigxcpu.org> Cc: Kevin Corry <kevcorry@us.ibm.com> Cc: stable@kernel.org	2007-10-20 02:00:58 +01:00
Bryn M. Reeves	c7ac86de6a	dm mpath: rdac fix init race Re-order the initialisation of dm-rdac to avoid registering the hw handler before the workqueue has been initialised. Closes a race that would potentially give an oops. Signed-off-by: Bryn M. Reeves <breeves@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:00:57 +01:00
Jan Engelhardt	96de0e252c	Convert files to UTF-8 and some cleanups * Convert files to UTF-8. * Also correct some people's names (one example is Eißfeldt, which was found in a source file. Given that the author used an ß at all in a source file indicates that the real name has in fact a 'ß' and not an 'ss', which is commonly used as a substitute for 'ß' when limited to 7bit.) * Correct town names (Goettingen -> Göttingen) * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313) Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:21:04 +02:00
Robert P. J. Day	3a4fa0a25d	Fix misspellings of "system", "controller", "interrupt" and "necessary". Fix the various misspellings of "system", controller", "interrupt" and "[un]necessary". Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:10:43 +02:00
Pavel Emelyanov	ba25f9dcc4	Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
NeilBrown	cf7a44168d	md: make sure read errors are auto-corrected during a 'check' resync in raid1 Whenever a read error is found, we should attempt to overwrite with correct data to 'fix' it. However when do a 'check' pass (which compares data blocks that are successfully read, but doesn't normally overwrite) we don't do that. We should. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Iustin Pop	d7f3d291a0	md: expose the degraded status of an assembled array through sysfs The 'degraded' attribute is useful to quickly determine if the array is degraded, instead of parsing 'mdadm -D' output or relying on the other techniques (number of working devices against number of defined devices, etc.). The md code already keeps track of this attribute, so it's useful to export it. Signed-off-by: Iustin Pop <iusty@k1024.org> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
NeilBrown	2b12ab6d33	md: 'sync_action' in sysfs returns wrong value for readonly arrays When an array is started read-only, MD_RECOVERY_NEEDED can be set but no recovery will be running. This causes 'sync_action' to report the wrong value. We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a small gap after requesting a sync action, where 'sync_action' would still report the old value. So make sure that for a read-only array, 'sync_action' always returns 'idle'. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
NeilBrown	8299d7f7c0	md: fix a bug in some never-used code. http://bugzilla.kernel.org/show_bug.cgi?id=3277 There is a seq_printf here that isn't being passed a 'seq'. Howeve as the code is inside #ifdef MD_DEBUG, nobody noticed. Also remove some extra spaces. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Michael J. Evans	4d936ec1fd	md: software Raid autodetect dev list not array In current release kernels the md module (Software RAID) uses a static array (dev_t[128]) to store partition/device info temporarily for autostart. I discovered this (and that the devices are added as disks/partitions are discovered at boot) while I was debugging why only one of my MD arrays would come up whole, while all the others were short a disk. I eventually discovered that it was enumerating through all of 9 of my 11 hds (2 had only 4 partitions apiece) while the other 9 have 15 partitions (I wanted 64 per drive...). The last partition of the 8th drive in my 9 drive raid 5 sets wasn't added, thus making the final md array short both a parity and data disk, and it was started later, elsewhere. This patch replaces that static array with a list. [akpm@linux-foundation.org: removed unused var] Signed-off-by: Michael J. Evans <mjevans1983@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Neil Brown	644bd2f048	Fix memory leak in dm-crypt dm-crypt used the ->bi_size member in the bio endio handling to free the appropriate pages, but it frees all of it from both call paths. With the ->bi_end_io() changes, ->bi_size was always 0 since we don't do partial completes. This caused dm-crypt to leak memory. Fix this by removing the size argument from crypt_free_buffer_pages(). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-16 13:48:46 +02:00
Jens Axboe	fd5d806266	block: convert blkdev_issue_flush() to use empty barriers Then we can get rid of ->issue_flush_fn() and all the driver private implementations of that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-16 11:05:02 +02:00
Geert Uytterhoeven	0dddfd46f0	dm: emc_endio returns void emc_endio returns void: linux/drivers/md/dm-emc.c: In function 'emc_endio': linux/drivers/md/dm-emc.c:58: warning: 'return' with a value, in function returning void Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-13 09:41:03 -07:00
Greg Kroah-Hartman	19c38de88a	kobjects: fix up improper use of the kobject name field A number of different drivers incorrect access the kobject name field directly. This is not correct as the name might not be in the array. Use the proper accessor function instead.	2007-10-12 14:51:02 -07:00
NeilBrown	6712ecf8f6	Drop 'size' argument from bio_endio and bi_end_io As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:57 +02:00
NeilBrown	66846572bf	Stop exporting blk_rq_bio_prep blk_rq_bio_prep is exported for use in exactly one place. That place can benefit from using the new blk_rq_append_bio instead. So - change dm-emc to call blk_rq_append_bio - stop exporting blk_rq_bio_prep, and - initialise rq_disk in blk_rq_bio_prep, as dm-emc needs it. Signed-off-by: Neil Brown <neilb@suse.de> diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:56 +02:00
Dan Williams	e4d84909dd	raid5: fix 2 bugs in ops_complete_biofill 1/ ops_complete_biofill tried to avoid calling handle_stripe since all the state necessary to return read completions is available. However the process of determining whether more read requests are pending requires locking the stripe (to block add_stripe_bio from updating dev->toead). ops_complete_biofill can run in tasklet context, so rather than upgrading all the stripe locks from spin_lock to spin_lock_bh this patch just unconditionally reschedules handle_stripe after completing the read request. 2/ ops_complete_biofill needlessly qualified processing R5_Wantfill with dev->toread. The result being that the 'biofill' pending bit is cleared before handling the pending read-completions on dev->read. R5_Wantfill can be unconditionally handled because the 'biofill' pending bit prevents new R5_Wantfill requests from being seen by ops_run_biofill and ops_complete_biofill. Found-by: Yuri Tikhonov <yur@emcraft.com> [neilb@suse.de: simpler fix for bug 1 than moving code] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-09-24 13:23:35 -07:00
aherrman@arcor.de	2123a09f3f	Fix kernel buuild with (CONFIG_COMPAT && ! CONFIG_BLOCK) Commit `02a5e0acb3` ("BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n") broke the kernel build for the CONFIG_COMPAT && !CONFIG_BLOCK case: CC fs/compat_ioctl.o In file included from include/linux/raid/md_k.h:19, from include/linux/raid/md.h:54, from fs/compat_ioctl.c:25: include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:40: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:48: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h:51: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:64: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_merge_: include/linux/raid/../../../drivers/md/dm-bio-list.h:78: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:90: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h:94: error: dereferencing pointer to incomplete type make[1]: * [fs/compat_ioctl.o] Error 1 make: * [fs] Error 2 Signed-off-by: Andreas Herrmann <aherrman@arcor.de> Acked-By: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-14 13:56:47 -07:00
NeilBrown	a2e0855182	md: fix some bugs with growing raid5/raid6 arrays. The recent changed to raid5 to allow offload of parity calculation etc introduced some bugs in the code for growing (i.e. adding a disk to) raid5 and raid6. This fixes them Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-11 17:21:19 -07:00
Andrew Vasquez	f99ba18a96	dm-mpath-rdac: don't stomp on a requests transfer bit Without this, we get qla2xxx complaining about "ISP System Error". What's happening here is the firmware is detecting a Xfer-ready from the storage when in fact the data-direction for a mode-select should be a write (DATA_OUT). The following patch fixes the problem (typo). Verified by Brian, as well. Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com> Verified-by: Brian De Wolf <bldewolf@csupomona.edu> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-27 16:15:44 -07:00
Randy Dunlap	6e106b0d97	DM_MULTIPATH_RDAC: "scsi_normalize_sense" undefined DM_MULTIPATH_RDAC uses SCSI API(s) and is for a SCSI device, so add SCSI to its depends on to prevent build errors. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> [ Tested and Verified by Chandra Seetharaman ] Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-24 16:10:39 -07:00
NeilBrown	a88aa7865b	md: correctly update sysfs when a raid1 is reshaped When a raid1 array is reshaped (number of drives changed), the list of devices is compacted, so that slots for missing devices are filled with working devices from later slots. This requires the "rd%d" symlinks in sysfs to be updated. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
NeilBrown	918f02383f	md: make sure a re-add after a restart honours bitmap when resyncing Commit `1757128438` was slightly bad. If an array has a write-intent bitmap, and you remove a drive, then readd it, only the changed parts should be resynced. However after the above commit, this only works if the array has not been shut down and restarted. This is because it sets 'fullsync' at little more often than it should. This patch is more careful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
Alan D. Brunelle	c7149d6bce	Fix remap handling by blktrace This patch provides more information concerning REMAP operations on block IOs. The additional information provides clearer details at the user level, and supports post-processing analysis in btt. o Adds in partition remaps on the same device. o Fixed up the remap information in DM to be in the right order o Sent up mapped-from and mapped-to device information Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-08-11 22:34:48 +02:00
Arne Redlich	f6f953aa99	md: handle writes to broken raid10 arrays gracefully When writing to a broken array, raid10 currently happily emits empty bio lists. IOW, the master bio will never be completed, sending writers to UNINTERRUPTIBLE_SLEEP forever. Signed-off-by: Arne Redlich <agr@powerkom-dd.de> Acked-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-31 15:39:38 -07:00
Maik Hampel	14e713446a	md: raid10: fix use-after-free of bio In case of read errors raid10d tries to print a nice error message, unfortunately using data from an already put bio. Signed-off-by: Maik Hampel <m.hampel@gmx.de> Acked-By: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-31 15:39:38 -07:00
Jens Axboe	165125e1e4	[BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-24 09:28:11 +02:00
Milan Broz	80b16c192e	dm io: fix panic on large request Flush workqueue before releasing bioset and mopools in dm-crypt. There can be finished but not yet released request. Call chain causing oops: run workqueue dec_pending bio_endio(...); <remove device request - remove mempool> mempool_free(io, cc->io_pool); This usually happens when cryptsetup create temporary luks mapping in the beggining of crypt device activation. When dm-core calls destructor crypt_dtr, no new request are possible. Signed-off-by: Milan Broz <mbroz@redhat.com> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Alasdair G Kergon <agk@redhat.com> Cc: Christophe Saout <christophe@saout.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-21 17:49:14 -07:00
Dan Williams	eb0645a8b1	async_tx: fix kmap_atomic usage in async_memcpy Andrew Morton: [async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap slot for both src add dest and we get either corrupted data or a BUG. Evgeniy Polyakov: Btw, shouldn't it always be kmap_atomic() even if flag is not set. That pages are usual one returned by alloc_page(). So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC flags. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-20 08:44:19 -07:00
Paul Mundt	20c2df83d2	mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's `c59def9f22` change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2007-07-20 10:11:58 +09:00
Yoann Padioleau	dd00cc486a	some kmalloc/memset ->kzalloc (tree wide) Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc). Here is a short excerpt of the semantic patch performing this transformation: @@ type T2; expression x; identifier f,fld; expression E; expression E1,E2; expression e1,e2,e3,y; statement S; @@ x = - kmalloc + kzalloc (E1,E2) ... when != $x->fld=E;\\|y=f(...,x,...);\\|f(...,x,...);\\|x=E;\\|while(...) S\\|for(e1;e2;e3) S$ - memset((T2)x,0,E1); @@ expression E1,E2,E3; @@ - kzalloc(E1 * E2,E3) + kcalloc(E1,E2,E3) [akpm@linux-foundation.org: get kcalloc args the right way around] Signed-off-by: Yoann Padioleau <padator@wanadoo.fr> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Acked-by: Russell King <rmk@arm.linux.org.uk> Cc: Bryan Wu <bryan.wu@analog.com> Acked-by: Jiri Slaby <jirislaby@gmail.com> Cc: Dave Airlie <airlied@linux.ie> Acked-by: Roland Dreier <rolandd@cisco.com> Cc: Jiri Kosina <jkosina@suse.cz> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Acked-by: Pierre Ossman <drzeus-list@drzeus.cx> Cc: Jeff Garzik <jeff@garzik.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Greg KH <greg@kroah.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:50 -07:00
Jesper Juhl	851a8a7fd4	dm: fix memory leak in dm_create_persistent() when starting metadata update thread fails If, in dm_create_persistent(), the call to create_singlethread_workqueue() fails then we'll return without freeing the memory allocated to 'ps', thus leaking sizeof(struct pstore) bytes. This patch fixes the leak. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-18 08:38:22 -07:00
NeilBrown	4ad1366376	md: change bitmap_unplug and others to void functions bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	f0d76d70bc	md: check that internal bitmap does not overlap other data We current completely trust user-space to set up metadata describing an consistant array. In particlar, that the metadata, data, and bitmap do not overlap. But userspace can be buggy, and it is better to report an error than corrupt data. So put in some appropriate checks. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	713f6ab18b	md: improve the is_mddev_idle test fix Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing we want to do with them is a signed comparison, and fix up the comment which had become quite wrong. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	df968c4e8d	md: improve message about invalid superblock during autodetect People try to use raid auto-detect with version-1 superblocks (which is not supported) and get confused when they are told they have an invalid superblock. So be more explicit, and say it it is not a valid v0.90 superblock. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
Jan Engelhardt	afd44034ac	Use menuconfig objects II - MD Change Kconfig objects from "menu, config" into "menuconfig" so that the user can disable the whole feature without having to enter the menu first. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
Akinobu Mita	00d59405cf	unregister_blkdev() delete redundant messages in callers No need to warn unregister_blkdev() failure by the callers. (The previous patch makes unregister_blkdev() print error message in error case) Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Rafael J. Wysocki	8314418629	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Linus Torvalds	e030dbf91a	Merge branch 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop * 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop: (28 commits) ioatdma: add the unisys "i/oat" pci vendor/device id ARM: Add drivers/dma to arch/arm/Kconfig iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver iop13xx: surface the iop13xx adma units to the iop-adma driver dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines md: remove raid5 compute_block and compute_parity5 md: handle_stripe5 - request io processing in raid5_run_ops md: handle_stripe5 - add request/completion logic for async expand ops md: handle_stripe5 - add request/completion logic for async read ops md: handle_stripe5 - add request/completion logic for async check ops md: handle_stripe5 - add request/completion logic for async compute ops md: handle_stripe5 - add request/completion logic for async write ops md: common infrastructure for running operations with raid5_run_ops md: raid5_run_ops - run stripe operations outside sh->lock raid5: replace custom debug PRINTKs with standard pr_debug raid5: refactor handle_stripe5 and handle_stripe6 (v3) async_tx: add the async_tx api xor: make 'xor_blocks' a library routine for use with async_tx dmaengine: make clients responsible for managing channels dmaengine: refactor dmaengine around dma_async_tx_descriptor ...	2007-07-13 10:52:27 -07:00
Dan Williams	f6dff381af	md: remove raid5 compute_block and compute_parity5 replaced by raid5_run_ops Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:18 -07:00
Dan Williams	830ea01673	md: handle_stripe5 - request io processing in raid5_run_ops I/O submission requests were already handled outside of the stripe lock in handle_stripe. Now that handle_stripe is only tasked with finding work, this logic belongs in raid5_run_ops. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	f0a50d3754	md: handle_stripe5 - add request/completion logic for async expand ops When a stripe is being expanded bulk copying takes place to move the data from the old stripe to the new. Since raid5_run_ops only operates on one stripe at a time these bulk copies are handled in-line under the stripe lock. In the dma offload case we poll for the completion of the operation. After the data has been copied into the new stripe the parity needs to be recalculated across the new disks. We reuse the existing postxor functionality to carry out this calculation. By setting STRIPE_OP_POSTXOR without setting STRIPE_OP_BIODRAIN the completion path in handle stripe can differentiate expand operations from normal write operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	b5e98d65d3	md: handle_stripe5 - add request/completion logic for async read ops When a read bio is attached to the stripe and the corresponding block is marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy the data from the stripe cache to the bio buffer. handle_stripe flags the blocks to be operated on with the R5_Wantfill flag. If new read requests arrive while raid5_run_ops is running they will not be handled until handle_stripe is scheduled to run again. Changelog: * cleanup to_read and to_fill accounting * do not fail reads that have reached the cache Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	e89f89629b	md: handle_stripe5 - add request/completion logic for async check ops Check operations are scheduled when the array is being resynced or an explicit 'check/repair' command was sent to the array. Previously check operations would destroy the parity block in the cache such that even if parity turned out to be correct the parity block would be marked !R5_UPTODATE at the completion of the check. When the operation can be carried out by a dma engine the assumption is that it can check parity as a read-only operation. If raid5_run_ops notices that the check was handled by hardware it will preserve the R5_UPTODATE status of the parity disk. When a check operation determines that the parity needs to be repaired we reuse the existing compute block infrastructure to carry out the operation. Repair operations imply an immediate write back of the data, so to differentiate a repair from a normal compute operation the STRIPE_OP_MOD_REPAIR_PD flag is added. Changelog: * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	f38e12199a	md: handle_stripe5 - add request/completion logic for async compute ops handle_stripe will compute a block when a backing disk has failed, or when it determines it can save a disk read by computing the block from all the other up-to-date blocks. Previously a block would be computed under the lock and subsequent logic in handle_stripe could use the newly up-to-date block. With the raid5_run_ops implementation the compute operation is carried out a later time outside the lock. To preserve the old functionality we take advantage of the dependency chain feature of async_tx to flag the block as R5_Wantcompute and then let other parts of handle_stripe operate on the block as if it were up-to-date. raid5_run_ops guarantees that the block will be ready before it is used in another operation. However, this only works in cases where the compute and the dependent operation are scheduled at the same time. If a previous call to handle_stripe sets the R5_Wantcompute flag there is no facility to pass the async_tx dependency chain across successive calls to raid5_run_ops. The req_compute variable protects against this case. Changelog: * remove the req_compute BUG_ON Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	e33129d841	md: handle_stripe5 - add request/completion logic for async write ops After handle_stripe5 decides whether it wants to perform a read-modify-write, or a reconstruct write it calls handle_write_operations5. A read-modify-write operation will perform an xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the new data into the stripe (biodrain) and perform a postxor operation across all up-to-date blocks to generate the new parity. A reconstruct write is run when all blocks are already up-to-date in the cache so all that is needed is a biodrain and postxor. On the completion path STRIPE_OP_PREXOR will be set if the operation was a read-modify-write. The STRIPE_OP_BIODRAIN flag is used in the completion path to differentiate write-initiated postxor operations versus expansion-initiated postxor operations. Completion of a write triggers i/o to the drives. Changelog: * make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:16 -07:00
Dan Williams	d84e0f10d3	md: common infrastructure for running operations with raid5_run_ops All the handle_stripe operations that are to be transitioned to use raid5_run_ops need a method to coherently gather work under the stripe-lock and hand that work off to raid5_run_ops. The 'get_stripe_work' routine runs under the lock to read all the bits in sh->ops.pending that do not have the corresponding bit set in sh->ops.ack. This modified 'pending' bitmap is then passed to raid5_run_ops for processing. The transition from 'ack' to 'completion' does not need similar protection as the existing release_stripe infrastructure will guarantee that handle_stripe will run again after a completion bit is set, and handle_stripe can tolerate a sh->ops.completed bit being set while the lock is held. A call to async_tx_issue_pending_all() is added to raid5d to kick the offload engines once all pending stripe operations work has been submitted. This enables batching of the submission and completion of operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:16 -07:00
Dan Williams	91c0092484	md: raid5_run_ops - run stripe operations outside sh->lock When the raid acceleration work was proposed, Neil laid out the following attack plan: 1/ move the xor and copy operations outside spin_lock(&sh->lock) 2/ find/implement an asynchronous offload api The raid5_run_ops routine uses the asynchronous offload api (async_tx) and the stripe_operations member of a stripe_head to carry out xor+copy operations asynchronously, outside the lock. To perform operations outside the lock a new set of state flags is needed to track new requests, in-flight requests, and completed requests. In this new model handle_stripe is tasked with scanning the stripe_head for work, updating the stripe_operations structure, and finally dropping the lock and calling raid5_run_ops for processing. The following flags outline the requests that handle_stripe can make of raid5_run_ops: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate a missing block in the cache from the other blocks STRIPE_OP_PREXOR - subtract existing data as part of the read-modify-write process STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct STRIPE_OP_IO - submit i/o to the member disks (note this was already performed outside the stripe lock, but it made sense to add it as an operation type The flow is: 1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending 2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the operation to the async_tx api 3/ async_tx triggers the completion callback routine to set sh->ops.complete and release the stripe 4/ handle_stripe runs again to finish the operation and optionally submit new operations that were previously blocked Note this patch just defines raid5_run_ops, subsequent commits (one per major operation type) modify handle_stripe to take advantage of this routine. Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the raid5_run_ops workqueue * call bi_end_io for reads in ops_complete_biofill, saves a call to handle_stripe * explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown * fix race between async engines and bi_end_io call for reads, Neil Brown * remove unnecessary spin_lock from ops_complete_biofill * remove test_and_set/test_and_clear BUG_ONs, Neil Brown * remove explicit interrupt handling for channel switching, this feature was absorbed (i.e. it is now implicit) by the async_tx api * use return_io in ops_complete_biofill Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	45b4233caa	raid5: replace custom debug PRINTKs with standard pr_debug Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in favor of the global DEBUG definition. To get local debug messages just add '#define DEBUG' to the top of the file. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	a445685647	raid5: refactor handle_stripe5 and handle_stripe6 (v3) handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes: * v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path * v3: removed the unused 'dirty' field from struct stripe_head_state * v3: coalesced open coded bi_end_io routines into return_io() Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	9bc89cd82d	async_tx: add the async_tx api The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor async_<operation>(..., struct dma_async_tx_descriptor depend_tx, dma_async_tx_callback cb_fn, void cb_param) { struct dma_chan chan = async_tx_find_channel(depend_tx, <operation>); struct dma_device device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor tx = device ? device->device_prep_dma_<operation>(chan, len, int_en) : NULL; if (tx) { / run <operation> asynchronously / ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { / run <operation> synchronously / ... <operation> ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select the API when MD is enabled * BUG_ON xor source counts <= 1 * implicitly handle hardware concerns like channel switching and interrupts, Neil Brown * remove the per operation type list, and distribute operation capabilities evenly amongst the available channels * simplify async_tx_find_channel to optimize the fast path * introduce the channel_table_initialized flag to prevent early calls to the api * reorganize the code to mimic crypto * include mm.h as not all archs include it in dma-mapping.h * make the Kconfig options non-user visible, Adrian Bunk * move async_tx under crypto since it is meant as 'core' functionality, and the two may share algorithms in the future * move large inline functions into c files * checkpatch.pl fixes * gpl v2 only correction Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:14 -07:00
Dan Williams	685784aaf3	xor: make 'xor_blocks' a library routine for use with async_tx The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <bunk@stusta.de> Cc: NeilBrown <neilb@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-07-13 08:06:14 -07:00
Chandra Seetharaman	dd172d72ad	dm mpath: rdac This patch supports LSI/Engenio devices in RDAC mode. Like dm-emc it requires userspace support. In your multipath.conf file you must have: path_checker rdac hardware_handler "1 rdac" prio_callout "/sbin/mpath_prio_tpc /dev/%n" And you also then must have a updated multipath tools release which has rdac support. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:23 -07:00
Jonathan Brassow	fc1ff9588a	dm raid1: handle log failure When writing to a mirror, the log must be updated first. Failure to update the log could result in the log not properly reflecting the state of the mirror if the machine should crash. We change the return type of the rh_flush function to give us the ability to check if a log write was successful. If the log write was unsuccessful, we fail the writes to avoid the case where the log does not properly reflect the state of the mirror. A follow-up patch - which is dependent on the ability to requeue I/O's to core device-mapper - will requeue the I/O's for retry (allowing the mirror to be reconfigured.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	f44db678ed	dm raid1: handle resync failures Device-mapper mirroring currently takes a best effort approach to recovery - failures during mirror synchronization are completely ignored. This means that regions are marked 'in-sync' and 'clean' and removed from the hash list. Future reads and writes that query the region will incorrectly interpret the region as in-sync. This patch handles failures during the recovery process. If a failure occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added to a new list 'failed_recovered_regions'. Regions on the 'failed_recovered_regions' list are not marked as 'clean' upon removal from the list. Furthermore, if the DM_RAID1_HANDLE_ERRORS flag is set, the region is marked as 'not-in-sync'. This action prevents any future read-balancing from choosing an invalid device because of the 'not-in-sync' status. If "handle_errors" is not specified when creating a mirror (leaving the DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they would be without this patch. This is to preserve backwards compatibility with user-space tools, such as 'pvmove'. However, since future read-balancing policies will rely on the correct sync status of a region, a user must choose "handle_errors" when using read-balancing. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	d0d444c7d4	dm: add ratelimit logging macros Add ratelimit extension to dm logging macros. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Stefan Bader	07a83c47cf	dm: disable barriers This patch causes device-mapper to reject any barrier requests. This is done since most of the targets won't handle this correctly anyway. So until the situation improves it is better to reject these requests at the first place. Since barrier requests won't get to the targets, the checks there can be removed. Cc: stable@kernel.org Signed-off-by: Stefan Bader <shbader@de.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	943317efdb	dm raid1: clear region outside spinlock A clear_region function is permitted to block (in practice, rare) but gets called in rh_update_states() with a spinlock held. The bits being marked and cleared by the above functions are used to update the on-disk log, but are never read directly. We can perform these operations outside the spinlock since the bits are only changed within one thread viz. - mark_region in rh_inc() - clear_region in rh_update_states(). So, we grab the clean_regions list items via list_splice() within the spinlock and defer clear_region() until we iterate over the list for deletion - similar to how the recovered_regions list is already handled. We then move the flush() call down to ensure it encapsulates the changes which are done by the later calls to clear_region(). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	0764147b11	dm snapshot: permit invalid activation Allow invalid snapshots to be activated instead of failing. This allows userspace to reinstate any given snapshot state - for example after an unscheduled reboot - and clean up the invalid snapshot at its leisure. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	fcac03abd3	dm snapshot: fix invalidation deadlock Process persistent exception store metadata IOs in a separate thread. A snapshot may become invalid while inside generic_make_request(). A synchronous write is then needed to update the metadata while still inside that function. Since the introduction of md-dm-reduce-stack-usage-with-stacked-block-devices.patch this has to be performed by a separate thread to avoid deadlock. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jun'ichi Nomura	596f138eed	dm io: fix panic on large request bio_alloc_bioset() will return NULL if 'num_vecs' is too large. Use bio_get_nr_vecs() to get estimation of maximum number. Cc: stable@kernel.org Signed-off-by: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	c95bc206da	dm raid1: fix status Fix mirror status line broken in dm-log-report-fault-status.patch: - space missing between two words - placeholder ("0") required for compatibility with a subsequent patch - incorrect offset parameter Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	0cd3312434	dm: remove duplicate module name from error msgs Remove explicit module name from messages as the macro now includes it automatically. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	ac818646d4	dm delay: cleanup Use setup_timer(). Replace semaphore with mutex. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	028867ac28	dm: use kmem_cache macro Use new KMEM_CACHE() macro and make the newly-exposed structure names more meaningful. Also remove some superfluous casts and inlines (let a modern compiler be the judge). Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Alasdair G Kergon	79e15ae424	dm: bio_list prefetch removal Remove dubious prefetch from bio_list_for_each() macro. Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Mike Accetta	ed45666271	md: fix bug in error handling during raid1 repair If raid1/repair (which reads all block and fixes any differences it finds) hits a read error, it doesn't reset the bio for writing before writing correct data back, so the read error isn't fixed, and the device probably gets a zero-length write which it might complain about. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
NeilBrown	af03b8e4e8	md: fix two raid10 bugs 1/ When resyncing a degraded raid10 which has more than 2 copies of each block, garbage can get synced on top of good data. 2/ We round the wrong way in part of the device size calculation, which can cause confusion. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
NeilBrown	a778b73ff7	md: fix bug with linear hot-add and elsewhere Adding a drive to a linear array seems to have stopped working, due to changes elsewhere in md, and insufficient ongoing testing... So the patch to make linear hot-add work in the first place introduced a subtle bug elsewhere that interracts poorly with older version of mdadm. This fixes it all up. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	ab6085c795	md: don't write more than is required of the last page of a bitmap It is possible that real data or metadata follows the bitmap without full page alignment. So limit the last write to be only the required number of bytes, rounded up to the hard sector size of the device. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	787f17feb2	md: avoid overflow in raid0 calculation with large components If a raid0 has a component device larger than 4TB, and is accessed on a 32bit machines, then as 'chunk' is unsigned long, chunk << chunksize_bits can overflow (this can be as high as the size of the device in KB). chunk itself will not overflow (without triggering a BUG). So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes impossible to hit. Cc: "Jeff Zheng" <Jeff.Zheng@endace.com> Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:14 -07:00
NeilBrown	435b71be20	md: improve the is_mddev_idle test During a 'resync' or similar activity, md checks if the devices in the array are otherwise active and winds back resync activity when they are. This test in done in is_mddev_idle, and it is somewhat fragile - it sometimes thinks there is non-sync io when there isn't. The test compares the total sectors of io (disk_stat_read) with the sectors of resync io (disk->sync_io). This has problems because total sectors gets updated when a request completes, while resync io gets updated when the request is submitted. The time difference can cause large differenced between the two which do not actually imply non-resync activity. The test currently allows for some fuzz (+/- 4096) but there are some cases when it is not enough. The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. If we do this then the amount of in-flight sync io will never cause the appearance of non-sync IO. Once enough non-sync IO to worry about starts happening, resync will be slowed down and the measurements will thus be more precise (as there is less in-flight) and control of resync will still be suitably responsive. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-11 08:29:37 -07:00
NeilBrown	dd00a99e7a	md: avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem. When a raid1 has only one working drive, we want read error to propagate up to the filesystem as there is no point failing the last drive in an array. Currently the code perform this check is racy. If a write and a read a both submitted to a device on a 2-drive raid1, and the write fails followed by the read failing, the read will see that there is only one working drive and will pass the failure up, even though the one working drive is actually the other one. So, tighten up the locking. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-10 09:26:53 -07:00
Linus Torvalds	44ce6294d0	Revert "md: improve partition detection in md array" This reverts commit `5b479c91da`. Quoth Neil Brown: "It causes an oops when auto-detecting raid arrays, and it doesn't seem easy to fix. The array may not be 'open' when do_md_run is called, so bdev->bd_disk might be NULL, so bd_set_size can oops. This whole approach of opening an md device before it has been assembled just seems to get more and more painful. I think I'm going to have to come up with something clever to provide both backward comparability with usage expectation, and sane integration into the rest of the kernel." Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 18:51:36 -07:00
NeilBrown	5b479c91da	md: improve partition detection in md array md currently uses ->media_changed to make sure rescan_partitions is call on md array after they are assembled. However that doesn't happen until the array is opened, which is later than some people would like. So use blkdev_ioctl to do the rescan immediately that the array has been assembled. This means we can remove all the ->change infrastructure as it was only used to trigger a partition rescan. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	08a02ecd28	md: allow reshape_position for md arrays to be set via sysfs "reshape_position" records how much progress has been made on a "reshape" (adding drives, changing layout or chunksize). When it is set, the number of drives, layout and chunksize can have two possible values, an old an a new. So allow these different values to be visible, and allow both old and new to be set: Set the old ones first, then the reshape_position, then the new values. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	42b9bebe3f	md: remove the slash from the name of a kmem_cache used by raid5 SLUB doesn't like slashes as it wants to use the cache name as the name of a directory (or symlink) in sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	4d167f0937	md: stop using csum_partial for checksum calculation in md If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot use it. We shouldn't really be using csum_partial anyway as it is an internal-to-networking interface. So replace it with C code to do the same thing. Speed is not crucial here, so something simple and correct is best. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
NeilBrown	e11e93facc	md: move test for whether level supports bitmap to correct place We need to check for internal-consistency of superblock in load_super. validate_super is for inter-device consistency. With the test in the wrong place, a badly created array will confuse md rather an produce sensible errors. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Martin Peschke	c3f94b40e1	md: cleanup: use seq_release_private() where appropriate We can save some lines of code by using seq_release_private(). Signed-off-by: Martin Peschke <mp3@de.ibm.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Ahmed S. Darwish	50511da3da	drivers/md.c: Use ARRAY_SIZE macro when appropriate Use ARRAY_SIZE macro already defined in kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:57 -07:00
Jonathan Brassow	ba8b45cea5	dm log: fix resume failed log device This patch removes the possibility of having uninitialized log state if the log device has failed. When a mirror resumes operation, it calls 'resume' on the logging module. If disk based logging is being used, the log device is read to fill in the log state. If the log device has failed, we cannot simply return, because this would leave the in-memory log state uninitialized. Instead, we assume all regions are out-of-sync and reset the log state. Failure to do this could result in the logging code reporting a region as in-sync, even though it isn't; which could result in a corrupted mirror. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Jonathan Brassow	b997b82d26	dm raid1: switch rh_in_sync to blocking in do_reads The call to rh_in_sync() in do_reads() should be allowed to block. It is in the mirror worker thread which already permits blocking operations. This will be needed to support clustered mirroring which will perform network operations. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Jonathan Brassow	f5353cd7c9	dm raid1: fix to commit pending clear region requests With the code as it is, it is possible for oustanding clear region requests never to get flushed when a mirror is deactivated or suspended. This means there will always be some resync work required when a mirror is activated, even though it may very well be in-sync. Always requesting the flush doesn't hurt us. This is because the log tracks whether any changes occurred and, if not, no flush is performed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:48 -07:00
Heinz Mauelshagen	26b9f22870	dm: delay target New device-mapper target that can delay I/O (for testing). Reads can be separated from writes, redirected to different underlying devices and delayed by differing amounts of time. Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	0ba699347e	dm: bio list helpers More bio_list helper functions for new targets (including dm-delay and dm-loop) to manipulate lists of bios. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Bryn Reeves <breeves@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	bf17ce3a60	dm io: remove old interface Remove old dm-io interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	88be163abb	dm raid1: update dm io interface This patch ports dm-raid1.c to the new dm-io interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	5d234d1e03	dm log: update dm io interface This patch ports dm-log.c to the new dm-io interface in order to make it scalable to have a large number of persistent dirty logs active in parallel. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	6cca1e7af5	dm exception store: update dm io interface This patch ports dm-exception-store.c to the new, scalable dm_io() interface. It replaces dm_io_get()/dm_io_put() by dm_io_client_create()/dm_io_client_destroy() calls and dm_io_sync_vm() by dm_io() to achive this. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Milan Broz	373a392bd7	dm kcopyd: update dm io interface This patch ports kcopyd.c to the new, scalable dm_io() interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	c8b03afe3d	dm io: new interface Add a new API to dm-io.c that uses a private mempool and bio_set for each client. The new functions to use are dm_io_client_create(), dm_io_client_destroy(), dm_io_client_resize() and dm_io(). Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	891ce20701	dm io: prepare for new interface Introduce struct dm_io_client to prepare for per-client mempools and bio_sets. Temporary functions bios() and io_pool() choose between the per-client structures and the global ones so the old and new interfaces can co-exist. Make error_bits optional. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Heinz Mauelshagen	c897feb3dc	dm io: delay dec_count Delay decrementing the 'struct io' reference count until after the bio has been freed so that a bio destructor function may reference it. Required by a later patch. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	a8e6afa236	dm raid1: add handle_errors feature flag This patch adds the ability to specify desired features in the mirror constructor/mapping table. The first feature of interest is "handle_errors". Currently, mirroring will ignore any I/O errors from the devices. Subsequent patches will check for this flag and handle the errors. If flag/feature is not present, mirror will do nothing - maintaining backwards compatibility. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	315dcc226f	dm log: report fault status This patch reports the status of the log device so that userspace can detect the error and take appropriate action. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Jonathan E Brassow	01d03a660e	dm log: fault detection This patch gives the disk logging code the ability to store the fact that an error occured on the log device. In addition, an event is raised when an error is encountered during I/O to the log device. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Mike Anderson	2cd54d9bed	dm: allow offline devices Allow check_device_area to succeed if a device has an i_size of zero. This addresses an issue seen on DASD devices setting up a multipath table for paths in online and offline state. Signed-off-by: Mike Anderson <andmike@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:47 -07:00
Edward Goggin	79eb885c96	dm mpath: log device name Make the mapped device structure accessible to hardware handlers so error messages can include the device name. Signed-off-by: Edward Goggin <egoggin@emc.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Ludwig Nussel	46b477306a	dm crypt: add null iv Add a new IV generation method 'null' to read old filesystem images created with SuSE's loop_fish2 module. Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de> Acked-By: Christophe Saout <christophe@saout.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	f97380bcad	dm crypt: use smaller bvecs in clones Allocate smaller clones With the previous dm-crypt fixes, there is no need for the clone bios to have the same bvec size as the original - we just need to make them big enough for the remaining number of pages. The only requirement is that we clear the "out" index in convert_context, so that crypt_convert starts storing data at the right position within the clone bio. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	2f9941b6c5	dm crypt: fix remove first_clone Get rid of first_clone in dm-crypt This gets rid of first_clone, which is not really needed. Apparently, cloned bios used to share their bvec some time way in the past - this is no longer the case. Contrarily, this even hurts us if we try to create a clone off first_clone after it has completed, and crypt_endio has destroyed its bvec. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	98221eb757	dm crypt: fix avoid cloned bio ref after free Do not access the bio after generic_make_request We should never access a bio after generic_make_request - there's no guarantee it still exists. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Olaf Kirch	027581f351	dm crypt: fix call to clone_init Call clone_init early We need to call clone_init as early as possible - at least before call bio_put(clone) in any error path. Otherwise, the destructor will try to dereference bi_private, which may still be NULL. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Milan Broz	9c89f8be1a	dm crypt: disable barriers Disable barriers in dm-crypt because of current workqueue processing can reorder requests. This must be addresed later but for now disabling barriers is needed to prevent data corruption. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Holger Smolinski	6ad36fe2b4	dm raid1: one kmirrord per mirror This patch replaces the single instance of kmirrord by one instance per mirror set. This change is required to avoid a deadlock in kmirrord when the persistent dirty log of a mirror itself resides on a mirror. The single instance of kmirrord then issues a sync write to the dirty log in write_bits which gets deferred to kmirrord itself later in the call chain. But kmirrord never does the deferred work because it is still waiting for the sync write_bits. _mirror_sets is removed as it no longer needed, and we always flush the workqueue before destroying it to ensure all work is complete before destroying it. Signed-off-by: Holger Smolinski <smolinski@de.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:46 -07:00
Mark Fasheh	ef51c97623	Remove do_sync_file_range() Remove do_sync_file_range() and convert callers to just use do_sync_mapping_range(). Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:04 -07:00
Peter Zijlstra	f98393a64c	mm: remove destroy_dirty_buffers from invalidate_bdev() Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't been used in 6 years (so akpm says). find * -name \.[ch] \| xargs grep -l invalidate_bdev \| while read file; do quilt add $file; sed -ie 's/invalidate_bdev($[^,]$,[^)]*)/invalidate_bdev(\1)/g' $file; done Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Jens Axboe	5972511b77	[BLOCK] Don't pin lots of memory in mempools Currently we scale the mempool sizes depending on memory installed in the machine, except for the bio pool itself which sits at a fixed 256 entry pre-allocation. There's really no point in "optimizing" this OOM path, we just need enough preallocated to make progress. A single unit is enough, lets scale it down to 2 just to be on the safe side. This patch saves ~150kb of pinned kernel memory on a 32-bit box. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-04-30 09:08:17 +02:00
Neil Brown	505fa2c4a2	[PATCH] md: fix calculation for size of filemap_attr array in md/bitmap If 'num_pages' were ever 1 more than a multiple of 8 (32bit platforms) or of 16 (64 bit platforms). filemap_attr would be allocated one 'unsigned long' shorter than required. We need a round-up in there. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-04-12 15:31:42 -07:00
NeilBrown	5792a2856a	[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs A device can be removed from an md array via e.g. echo remove > /sys/block/md3/md/dev-sde/state This will try to remove the 'dev-sde' subtree which will deadlock since commit `e7b0d26a86` With this patch we run the kobject_del via schedule_work so as to avoid the deadlock. Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-04-04 21:12:47 -07:00
NeilBrown	5e55e2f5fc	[PATCH] md: convert compile time warnings into runtime warnings ... still not sure why we need this .... Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:15 -07:00
NeilBrown	041ae52e26	[PATCH] md: clear the congested_fn when stopping a raid5 If this mddev and queue got reused for another array that doesn't register a congested_fn, this function would get called incorretly. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:14 -07:00
NeilBrown	3d37890baa	[PATCH] md: allow raid4 arrays to be reshaped All that is missing the the function pointers in raid4_pers. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-27 09:05:14 -07:00
Andy Isaacson	bed31ed9e1	[PATCH] fix read past end of array in md/linear.c When iterating through an array, one must be careful to test one's index variable rather than another similarly-named variable. The loop will read off the end of conf->disks[] in the following (pathological) case: % dd bs=1 seek=840716287 if=/dev/zero of=d1 count=1 % for i in 2 3 4; do dd if=/dev/zero of=d$i bs=1k count=$(($i+150)); done % ./vmlinux ubd0=root ubd1=d1 ubd2=d2 ubd3=d3 ubd4=d4 # mdadm -C /dev/md0 --level=linear --raid-devices=4 /dev/ubd[1234] adding some printks, I saw this: [42949374.960000] hash_spacing = 821120 [42949374.960000] cnt = 4 [42949374.960000] min_spacing = 801 [42949374.960000] j=0 size=820928 sz=820928 [42949374.960000] i=0 sz=820928 hash_spacing=820928 [42949374.960000] j=1 size=64 sz=64 [42949374.960000] j=2 size=64 sz=128 [42949374.960000] j=3 size=64 sz=192 [42949374.960000] j=4 size=1515870810 sz=1515871002 Cc: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Neil Brown <neilb@cse.unsw.edu.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-16 19:25:03 -07:00
NeilBrown	6d3baf2eb8	[PATCH] md: fix for raid6 reshape Recent patch for raid6 reshape had a change missing that showed up in subsequent review. Many places in the raid5 code used "conf->raid_disks-1" to mean "number of data disks". With raid6 that had to be changed to "conf->raid_disk - conf->max_degraded" or similar. One place was missed. This bug means that if a raid6 reshape were aborted in the middle the recorded position would be wrong. On restart it would either fail (as the position wasn't on an appropriate boundary) or would leave a section of the array unreshaped, causing data corruption. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-05 07:57:53 -08:00
NeilBrown	f416885ef4	[PATCH] md: add support for reshape of a raid6 i.e. one or more drives can be added and the array will re-stripe while on-line. Most of the interesting work was already done for raid5. This just extends it to raid6. mdadm newer than 2.6 is needed for complete safety, however any version of mdadm which support raid5 reshape will do a good enough job in almost all cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a reshape that was aborted and had to be restarted with an such a version of mdadm). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	b4c4c7b809	[PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error An error always aborts any resync/recovery/reshape on the understanding that it will immediately be restarted if that still makes sense. However a reshape currently doesn't get restarted. With this patch it does. To avoid restarting when it is not possible to do work, we call into the personality to check that a reshape is ok, and strengthen raid5_check_reshape to fail if there are too many failed devices. We also break some code out into a separate function: remove_and_add_spares as the indent level for that code was getting crazy. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	d1b5380c7f	[PATCH] md: clean out unplug and other queue function on md shutdown The mddev and queue might be used for another array which does not set these, so they need to be cleared. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	7dd5e7c3db	[PATCH] md: move warning about creating a raid array on partitions of the one device md tries to warn the user if they e.g. create a raid1 using two partitions of the same device, as this does not provide true redundancy. However it also warns if a raid0 is created like this, and there is nothing wrong with that. At the place where the warning is currently printer, we don't necessarily know what level the array will be, so move the warning from the point where the device is added to the point where the array is started. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
H. Peter Anvin	a723406c4a	[PATCH] md: RAID6: clean up CPUID and FPU enter/exit code - Use kernel_fpu_begin() and kernel_fpu_end() - Use boot_cpu_has() for feature testing even in userspace Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
NeilBrown	64a742bc61	[PATCH] md: fix raid10 recovery problem. There are two errors that can lead to recovery problems with raid10 when used in 'far' more (not the default). Due to a '>' instead of '>=' the wrong block is located which would result in garbage being written to some random location, quite possible outside the range of the device, causing the newly reconstructed device to fail. The device size calculation had some rounding errors (it didn't round when it should) and so recovery would go a few blocks too far which would again cause a write to a random block address and probably a device error. The code for working with device sizes was fairly confused and spread out, so this has been tided up a bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-03-01 14:53:36 -08:00
Eric W. Biederman	0b4d414714	[PATCH] sysctl: remove insert_at_head from register_sysctl The semantic effect of insert_at_head is that it would allow new registered sysctl entries to override existing sysctl entries of the same name. Which is pain for caching and the proc interface never implemented. I have done an audit and discovered that none of the current users of register_sysctl care as (excpet for directories) they do not register duplicate sysctl entries. So this patch simply removes the support for overriding existing entries in the sys_sysctl interface since no one uses it or cares and it makes future enhancments harder. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Corey Minyard <minyard@acm.org> Cc: Neil Brown <neilb@suse.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jan Kara <jack@ucw.cz> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-14 08:09:59 -08:00
Eric W. Biederman	ff1d28efc5	[PATCH] sysctl: md: remove unnecessary insert_at_head flag The sysctls used by the md driver are have unique binary numbers so remove the insert_at_head flag as it serves no useful purpose. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-14 08:09:55 -08:00
Arjan van de Ven	fa027c2a0a	[PATCH] mark struct file_operations const 4 Many struct file_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. [akpm@sdl.org: dvb fix] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-12 09:48:45 -08:00
Andrew Morton	fc0ecff698	[PATCH] remove invalidate_inode_pages() Convert all calls to invalidate_inode_pages() into open-coded calls to invalidate_mapping_pages(). Leave the invalidate_inode_pages() wrapper in place for now, marked as deprecated. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-11 10:51:31 -08:00
Neil Brown	da6e1a32fb	[PATCH] md: avoid possible BUG_ON in md bitmap handling md/bitmap tracks how many active write requests are pending on blocks associated with each bit in the bitmap, so that it knows when it can clear the bit (when count hits zero). The counter has 14 bits of space, so if there are ever more than 16383, we cannot cope. Currently the code just calles BUG_ON as "all" drivers have request queue limits much smaller than this. However is seems that some don't. Apparently some multipath configurations can allow more than 16383 concurrent write requests. So, in this unlikely situation, instead of calling BUG_ON we now wait for the count to drop down a bit. This requires a new wait_queue_head, some waiting code, and a wakeup call. Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower in that case...). Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-09 09:25:47 -08:00
Neil Brown	387bb17374	[PATCH] md: fix various bugs with aligned reads in RAID5 It is possible for raid5 to be sent a bio that is too big for an underlying device. So if it is a READ that we pass stright down to a device, it will fail and confuse RAID5. So in 'chunk_aligned_read' we check that the bio fits within the parameters for the target device and if it doesn't fit, fall back on reading through the stripe cache and making lots of one-page requests. Note that this is the earliest time we can check against the device because earlier we don't have a lock on the device, so it could change underneath us. Also, the code for handling a retry through the cache when a read fails has not been tested and was badly broken. This patch fixes that code. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Kai" <epimetreus@fastmail.fm> Cc: <stable@suse.de> Cc: <org@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-09 09:25:46 -08:00
NeilBrown	c20086de93	[PATCH] md: remove unnecessary printk when raid5 gets an unaligned read. raid5_mergeable_bvec tries to ensure that raid5 never sees a read request that does not fit within just one chunk. However as we must always accept a single-page read, that is not always possible. So when "in_chunk_boundary" fails, it might be unusual, but it is not a problem and printing a message every time is a bad idea. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
NeilBrown	2a2275d630	[PATCH] md: fix potential memalloc deadlock in md If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held, it is possible for a deadlock to eventuate. This happens if the array was marked 'clean', and the memalloc triggers a write-out to the md device. For the writeout to succeed, the array must be marked 'dirty', and that requires getting the mddev_lock. So, before attempting a GFP_KERNEL allocation while holding the lock, make sure the array is marked 'dirty' (unless it is currently read-only). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
Jun'ichi Nomura	bfa152fa5e	[PATCH] dm-multipath: fix stall on noflush suspend/resume Allow noflush suspend/resume of device-mapper device only for the case where the device size is unchanged. Otherwise, dm-multipath devices can stall when resumed if noflush was used when suspending them, all paths have failed and queue_if_no_path is set. Explanation: 1. Something is doing fsync() on the block dev, holding inode->i_sem 2. The fsync write is blocked by all-paths-down and queue_if_no_path 3. Someone requests to suspend the dm device with noflush. Pending writes are left in queue. 4. In the middle of dm_resume(), __bind() tries to get inode->i_sem to do __set_size() and waits forever. 'noflush suspend' is a new device-mapper feature introduced in early 2.6.20. So I hope the fix being included before 2.6.20 is released. Example of reproducer: 1. Create a multipath device by dmsetup 2. Fail all paths during mkfs 3. Do dmsetup suspend --noflush and load new map with healthy paths 4. Do dmsetup resume Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:51:00 -08:00
NeilBrown	f49d5e62d9	[PATCH] md: avoid reading past the end of a bitmap file In most cases we check the size of the bitmap file before reading data from it. However when reading the superblock, we always read the first PAGE_SIZE bytes, which might not always be appropriate. So limit that read to the size of the file if appropriate. Also, we get the count of available bytes wrong in one place, so that too can read past the end of the file. Cc: "yang yin" <yinyang801120@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
NeilBrown	1031be7a5f	[PATCH] md: make sure the events count in an md array never returns to zero Now that we sometimes step the array events count backwards (when transitioning dirty->clean where nothing else interesting has happened - so that we don't need to write to spares all the time), it is possible for the event count to return to zero, which is potentially confusing and triggers and MD_BUG. We could possibly remove the MD_BUG, but is just as easy, and probably safer, to make sure we never return to zero. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
NeilBrown	3eda22d19b	[PATCH] md: make 'repair' actually work for raid1 When 'repair' finds a block that is different one the various parts of the mirror. it is meant to write a chosen good version to the others. However it currently writes out the original data to each. The memcpy to make all the data the same is missing. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-01-26 13:50:59 -08:00
Lars Ellenberg	e3881a6816	[PATCH] md: pass down BIO_RW_SYNC in raid{1,10} md raidX make_request functions strip off the BIO_RW_SYNC flag, thus introducing additional latency. Fixing this in raid1 and raid10 seems to be straightforward enough. For our particular usage case in DRBD, passing this flag improved some initialization time from ~5 minutes to ~5 seconds. Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Lars Ellenberg <lars@linbit.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2007-01-11 18:18:21 -08:00
NeilBrown	3f9d7b0d81	[PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md While developing more functionality in mdadm I found some bugs in md... - When we remove a device from an inactive array (write 'remove' to the 'state' sysfs file - see 'state_store') would should not update the superblock information - as we may not have read and processed it all properly yet. - initialise all raid_disk entries to '-1' else the 'slot sysfs file will claim '0' for all devices in an array before the array is started. - all '\n' not to be present at the end of words written to sysfs files - when we use SET_ARRAY_INFO to set the md metadata version, set the flag to say that there is persistant metadata. - allow GET_BITMAP_FILE to be called on an array that hasn't been started yet. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-22 08:55:51 -08:00
NeilBrown	802ba064c4	[PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly Thanks Jens for alerting me to this. Cc: Jens Axboe <jens.axboe@oracle.com> Cc: <raziebe@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-13 09:05:48 -08:00
Herbert Xu	3263263f70	[CRYPTO] dm-crypt: Select CRYPTO_CBC As CBC is the default chaining method for cryptoloop, we should select it from cryptoloop to ease the transition. Spotted by Rene Herman. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 10:18:57 -08:00
NeilBrown	1757128438	[PATCH] md: assorted md and raid1 one-liners Fix few bugs that meant that: - superblocks weren't alway written at exactly the right time (this could show up if the array was not written to - writting to the array causes lots of superblock updates and so hides these errors). - restarting device recovery after a clean shutdown (version-1 metadata only) didn't work as intended (or at all). 1/ Ensure superblock is updated when a new device is added. 2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync. The body of this if takes one of two branches depending on whether MD_RECOVERY_SYNC is set, so testing it in the clause of the if is wrong. 3/ Flag superblock for updating after a resync/recovery finishes. 4/ If we find the neeed to restart a recovery in the middle (version-1 metadata only) make sure a full recovery (not just as guided by bitmaps) does get done. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	c2b00852fb	[PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error to higher levels. While this should be sufficient, it is safer to explicitly set the error code as well - less room for confusion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	b8c6b64556	[PATCH] md: remove some old ifdefed-out code from raid5.c There are some vestiges of old code that was used for bypassing the stripe cache on reads in raid5.c. This was never updated after the change from buffer_heads to bios, but was left as a reminder. That functionality has nowe been implemented in a completely different way, so the old code can go. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
Jeff Garzik	fdee8ae449	[PATCH] MD: conditionalize some code The autorun code is only used if this module is built into the static kernel image. Adjust #ifdefs accordingly. Signed-off-by: Jeff Garzik <jeff@garzik.org> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
NeilBrown	b875e531fc	[PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx stripe_to_pdidx finds the index of the parity disk for a given stripe. It assumes raid5 in that it uses "disks-1" to determine the number of data disks. This is incorrect for raid6 but fortunately the two usages cancel each other out. The only way that 'data_disks' affects the calculation of pd_idx in raid5_compute_sector is when it is divided into the sector number. But as that sector number is calculated by multiplying in the wrong value of 'data_disks' the division produces the right value. So it is innocuous but needs to be fixed. Also change the calculation of raid_disks in compute_blocknr to make it more obviously correct (it seems at first to always use disks-1 too). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:21 -08:00
Raz Ben-Jehuda(caro)	5248861511	[PATCH] md: enable bypassing cache for reads Call the chunk_aligned_read where appropriate. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	46031f9a38	[PATCH] md: allow reads that have bypassed the cache to be retried on failure If a bypass-the-cache read fails, we simply try again through the cache. If it fails again it will trigger normal recovery precedures. update 1: From: NeilBrown <neilb@suse.de> 1/ chunk_aligned_read and retry_aligned_read assume that data_disks == raid_disks - 1 which is not true for raid6. So when an aligned read request bypasses the cache, we can get the wrong data. 2/ The cloned bio is being used-after-free in raid5_align_endio (to test BIO_UPTODATE). 3/ We forgot to add rdev->data_offset when submitting a bio for aligned-read 4/ clone_bio calls blk_recount_segments and then we change bi_bdev, so we need to invalidate the segment counts. 5/ We don't de-reference the rdev when the read completes. This means we need to record the rdev to so it is still available in the end_io routine. Fortunately bi_next in the original bio is unused at this point so we can stuff it in there. 6/ We leak a cloned bio if the target rdev is not usable. From: NeilBrown <neilb@suse.de> update 2: 1/ When aligned requests fail (read error) they need to be retried via the normal method (stripe cache). As we cannot be sure that we can process a single read in one go (we may not be able to allocate all the stripes needed) we store a bio-being-retried and a list of bioes-that-still-need-to-be-retried. When find a bio that needs to be retried, we should add it to the list, not to single-bio... 2/ We were never incrementing 'scnt' when resubmitting failed aligned requests. [akpm@osdl.org: build fix] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	f679623f50	[PATCH] md: handle bypassing the read cache (assuming nothing fails) Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)	23032a0eb9	[PATCH] md: define raid5_mergeable_bvec This will encourage read request to be on only one device, so we will often be able to bypass the cache for read requests. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
NeilBrown	0d4ca600fc	[PATCH] md: tidy up device-change notification when an md array is stopped An md array can be stopped leaving all the setting still in place, or it can torn down and destroyed. set_capacity and other change notifications only happen in the latter case, but should happen in both. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-10 09:57:20 -08:00
Adrian Bunk	c642f9e03b	[PATCH] make drivers/md/dm-snap.c:ksnapd static Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Jonathan E Brassow	88b20a1a71	[PATCH] dm: raid1: reset sync_search on resume Reset sync_search on resume. The effect is to retry syncing all out-of-sync regions when a mirror is resumed, including ones that previously failed. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Jonathan E Brassow	f3ee6b2f62	[PATCH] dm: log: rename complete_resync_work The complete_resync_work function only provides the ability to change an out-of-sync region to in-sync. This patch enhances the function to allow us to change the status from in-sync to out-of-sync as well, something that is needed when a mirror write to one of the devices or an initial resync on a given region fails. Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Milan Broz	31c93a0c29	[PATCH] dm: snapshot: abstract memory release Move the code that releases memory used by a snapshot into a separate function. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	45e157206c	[PATCH] dm: mpath: use noflush suspending Implement the pushback feature for the multipath target. The pushback request is used when: 1) there are no valid paths; 2) queue_if_no_path was set; 3) a suspend is being issued with the DMF_NOFLUSH_SUSPENDING flag. Otherwise bios are returned to applications with -EIO. To check whether queue_if_no_path is specified or not, you need to check both queue_if_no_path and saved_queue_if_no_path, because presuspend saves the original queue_if_no_path value to saved_queue_if_no_path. The check for 1 already exists in both map_io() and do_end_io(). So this patch adds __must_push_back() to check 2 and 3. Test results: See the test results in the preceding patch. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	2e93ccc193	[PATCH] dm: suspend: add noflush pushback In device-mapper I/O is sometimes queued within targets for later processing. For example the multipath target can be configured to store I/O when no paths are available instead of returning it -EIO. This patch allows the device-mapper core to instruct a target to transfer the contents of any such in-target queue back into the core. This frees up the resources used by the target so the core can replace that target with an alternative one and then resend the I/O to it. Without this patch the only way to change the target in such circumstances involves returning the I/O with an error back to the filesystem/application. In the multipath case, this patch will let us add new paths for existing I/O to try after all the existing paths have failed. DMF_NOFLUSH_SUSPENDING ---------------------- If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It is always cleared before dm_suspend() returns. The flag must be visible while the target is flushing pending I/Os so it is set before presuspend where the flush starts and unset after the wait for md->pending where the flush ends. Target drivers can check this flag by calling dm_noflush_suspending(). DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE ----------------------------------- A target's map() function can now return DM_MAPIO_REQUEUE to request the device mapper core queue the bio. Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request the same. This has been labelled 'pushback'. The __map_bio() and clone_endio() functions in the core treat these return values as errors and call dec_pending() to end the I/O. dec_pending ----------- dec_pending() saves the pushback request in struct dm_io->error. Once all the split clones have ended, dec_pending() will put the original bio on the md->pushback list. Note that this supercedes any I/O errors. It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while in progress (e.g. by user interrupt). dec_pending() checks for this and returns -EIO if it happened. pushdback list and pushback_lock -------------------------------- The bio is queued on md->pushback temporarily in dec_pending(), and after all pending I/Os return, md->pushback is merged into md->deferred in dm_suspend() for re-issuing at resume time. md->pushback_lock protects md->pushback. The lock should be held with irq disabled because dec_pending() can be called from interrupt context. Queueing bios to md->pushback in dec_pending() must be done atomically with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is held when checking the flag. Otherwise dec_pending() may queue a bio to md->pushback after the interrupted dm_suspend() flushes md->pushback. Then the bio would be left in md->pushback. Flag setting in dm_suspend() can be done without md->pushback_lock because the flag is checked only after presuspend and the set value is already made visible via the target's presuspend function. The flag can be checked without md->pushback_lock (e.g. the first part of the dec_pending() or target drivers), because the flag is checked again with md->pushback_lock held when the bio is really queued to md->pushback as described above. So even if the flag is cleared after the lockless checkings, the bio isn't left in md->pushback but returned to applications with -EIO. Other notes on the current patch -------------------------------- - md->pushback is added to the struct mapped_device instead of using md->deferred directly because md->io_lock which protects md->deferred is rw_semaphore and can't be used in interrupt context like dec_pending(), and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too. - Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG ioctl option is specified, because I/Os generated by lock_fs() would be pushed back and never return if there were no valid devices. - If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING flag is set, md->pushback must be flushed because I/Os may be queued to the list already. (flush_and_out label in dm_suspend()) Test results ------------ I have tested using multipath target with the next patch. The following tests are for regression/compatibility: - I/Os succeed when valid paths exist; - I/Os fail when there are no valid paths and queue_if_no_path is not set; - I/Os are queued in the multipath target when there are no valid paths and queue_if_no_path is set; - The queued I/Os above fail when suspend is issued without the DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also fail. The following tests are for the normal code path of new pushback feature: - Queued I/Os in the multipath target are flushed from the target but don't return when suspend is issued with the DM_NOFLUSH_FLAG ioctl option; - The I/Os above are queued in the multipath target again when resume is issued without path recovery; - The I/Os above succeed when resume is issued after path recovery or table load; - Queued I/Os in the multipath target succeed when resume is issued with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os spanning 2 multipath targets also succeed. The following tests are for the error paths of the new pushback feature: - When the bdget_disk() fails in dm_suspend(), the DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the pushback list are flushed properly. - When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted, o I/Os which had already been queued to the pushback list at the time don't return, and are re-issued at resume time; o I/Os which hadn't been returned at the time return with EIO. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	81fdb096db	[PATCH] dm: ioctl: add noflush suspend Provide a dm ioctl option to request noflush suspending. (See next patch for what this is for.) As the interface is extended, the version number is incremented. Other than accepting the new option through the interface, There is no change to existing behaviour. Test results: Confirmed the option is given from user-space correctly. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	d2a7ad29a8	[PATCH] dm: map and endio symbolic return codes Update existing targets to use the new symbols for return values from target map and end_io functions. There is no effect on behaviour. Test results: Done build test without errors. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	45cbcd7983	[PATCH] dm: map and endio return code clarification Tighten the use of return values from the target map and end_io functions. Values of 2 and above are now explictly reserved for future use. There are no existing targets using such values. The patch has no effect on existing behaviour. o Reserve return values of 2 and above from target map functions. Any positive value currently indicates "mapping complete", but all existing drivers use the value 1. We now make that a requirement so we can assign new meaning to higher values in future. The new definition of return values from target map functions is: < 0 : error = 0 : The target will handle the io (DM_MAPIO_SUBMITTED). = 1 : Mapping completed (DM_MAPIO_REMAPPED). > 1 : Reserved (undefined). Previously this was the same as '= 1'. o Reserve return values of 2 and above from target end_io functions for similar reasons. DM_ENDIO_INCOMPLETE is introduced for a return value of 1. Test results: I have tested by using the multipath target. I/Os succeed when valid paths exist. I/Os are queued in the multipath target when there are no valid paths and queue_if_no_path is set. I/Os fail when there are no valid paths and queue_if_no_path is not set. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	a3d77d35be	[PATCH] dm: suspend: parameter change Change the interface of dm_suspend() so that we can pass several options without increasing the number of parameters. The existing 'do_lockfs' integer parameter is replaced by a flag DM_SUSPEND_LOCKFS_FLAG. There is no functional change to the code. Test results: I have tested 'dmsetup suspend' command with/without the '--nolockfs' option and confirmed the do_lockfs value is correctly set. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:09 -08:00
Kiyoshi Ueda	7485936463	[PATCH] dm: tidy core formatting Remove unnecessary spaces in dm.c. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:08 -08:00
Heinz Mauelshagen	f00b16ad66	[PATCH] dm io: fix bi_max_vecs The existing code allocates an extra slot in bi_io_vec[] and uses it to store the region number. This patch hides the extra slot from bio_add_page() so the region number can't get overwritten. Also remove a hard-coded SECTOR_SHIFT and fix a typo in a comment. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:29:08 -08:00
David Howells	f0d1b0b30d	[PATCH] LOG2: Implement a general integer log2 facility in the kernel This facility provides three entry points: ilog2() Log base 2 of unsigned long ilog2_u32() Log base 2 of u32 ilog2_u64() Log base 2 of u64 These facilities can either be used inside functions on dynamic data: int do_something(long q) { ...; y = ilog2(x) ...; } Or can be used to statically initialise global variables with constant values: unsigned n = ilog2(27); When performing static initialisation, the compiler will report "error: initializer element is not constant" if asked to take a log of zero or of something not reducible to a constant. They treat negative numbers as unsigned. When not dealing with a constant, they fall back to using fls() which permits them to use arch-specific log calculation instructions - such as BSR on x86/x86_64 or SCAN on FRV - if available. [akpm@osdl.org: MMC fix] Signed-off-by: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David Howells <dhowells@redhat.com> Cc: Wojtek Kaniewski <wojtekka@toxygen.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:51 -08:00
Josef Sipek	c649bb9c55	[PATCH] struct path: convert md Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:47 -08:00
Josef "Jeff" Sipek	c922d5f7f5	[PATCH] struct path: rename DM's struct path Rename DM's struct path to struct dm_path to prevent name collision between it and struct path from fs/namei.c. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Acked-by: Alasdair G Kergon <agk@redhat.com> Cc: <reiserfs-dev@namesys.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:40 -08:00
NeilBrown	d63a5a74de	[PATCH] lockdep: avoid lockdep warning in md md_open takes ->reconfig_mutex which causes lockdep to complain. This (normally) doesn't have deadlock potential as the possible conflict is with a reconfig_mutex in a different device. I say "normally" because if a loop were created in the array->member hierarchy a deadlock could happen. However that causes bigger problems than a deadlock and should be fixed independently. So we flag the lock in md_open as a nested lock. This requires defining mutex_lock_interruptible_nested. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:39 -08:00
Peter Zijlstra	2e7b651df1	[PATCH] remove the old bd_mutex lockdep annotation Remove the old complex and crufty bd_mutex annotation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-08 08:28:38 -08:00
Linus Torvalds	2685b267bc	Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits) [NETFILTER]: Fix non-ANSI func. decl. [TG3]: Identify Serdes devices more clearly. [TG3]: Use msleep. [TG3]: Use netif_msg_*. [TG3]: Allow partial speed advertisement. [TG3]: Add TG3_FLG2_IS_NIC flag. [TG3]: Add 5787F device ID. [TG3]: Fix Phy loopback. [WANROUTER]: Kill kmalloc debugging code. [TCP] inet_twdr_hangman: Delete unnecessary memory barrier(). [NET]: Memory barrier cleanups [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries. audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n audit: Add auditing to ipsec [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n [IrDA]: Incorrect TTP header reservation [IrDA]: PXA FIR code device model conversion [GENETLINK]: Fix misplaced command flags. [NETLIK]: Add a pointer to the Generic Netlink wiki page. [IPV6] RAW: Don't release unlocked sock. ...	2006-12-07 09:05:15 -08:00
Nigel Cunningham	7dfb71030f	[PATCH] Add include/linux/freezer.h and move definitions from sched.h Move process freezing functions from include/linux/sched.h to freezer.h, so that modifications to the freezer or the kernel configuration don't require recompiling just about everything. [akpm@osdl.org: fix ueagle driver] Signed-off-by: Nigel Cunningham <nigel@suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-07 08:39:27 -08:00
Christoph Lameter	e18b890bb0	[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name ".c" -o -name ".h"\|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-07 08:39:25 -08:00
Herbert Xu	79066ad32b	[CRYPTO] dm-crypt: Make iv_gen_private a union Rather than stuffing integers into pointers with casts, let's use a union. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-06 18:39:01 -08:00
Herbert Xu	45789328e5	[BLOCK] dm-crypt: Align IV to u64 for essiv This patch makes the IV u64-aligned since essiv does a u64 store to it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2006-12-06 18:38:48 -08:00
Rik Snel	48527fa7cf	[BLOCK] dm-crypt: benbi IV, big endian narrow block count for LRW-32-AES LRW-32-AES needs a certain IV. This IV should be provided dm-crypt. The block cipher mode could, in principle generate the correct IV from the plain IV, but I think that it is cleaner to supply the right IV directly. The sector -> narrow block calculation uses a shift for performance reasons. This shift is computed in .ctr and stored in cc->iv_gen_private (as a void *). Signed-off-by: Rik Snel <rsnel@cube.dyndns.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2006-12-06 18:38:47 -08:00
David Howells	c4028958b6	WorkStruct: make allyesconfig Fix up for make allyesconfig. Signed-Off-By: David Howells <dhowells@redhat.com>	2006-11-22 14:57:56 +00:00
Rafael J. Wysocki	4b438a23fb	[PATCH] md: do not freeze md threads for suspend If there's a swap file on a software RAID, it should be possible to use this file for saving the swsusp's suspend image. Also, this file should be available to the memory management subsystem when memory is being freed before the suspend image is created. For the above reasons it seems that md_threads should not be frozen during the suspend and the appended patch makes this happen, but then there is the question if they don't cause any data to be written to disks after the suspend image has been created, provided that all filesystems are frozen at that time. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-11-08 18:29:24 -08:00
NeilBrown	0692c6b1cf	[PATCH] md: fix sizing problem with raid5-reshape and CONFIG_LBD=n I forgot to has the size-in-blocks to (loff_t) before shifting up to a size-in-bytes. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-11-08 18:29:24 -08:00
NeilBrown	2f47130361	[PATCH] md: change ONLINE/OFFLINE events to a single CHANGE event It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons (not least of which being that udev understands it already). So remove the recently added KOBJ_OFFLINE (no-one is likely to care anyway) and change the ONLINE to a CHANGE event Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-11-08 18:29:23 -08:00
Jonathan E Brassow	33184048dc	[PATCH] dm: raid1: fix waiting for io on suspend All device-mapper targets must complete outstanding I/O before suspending. The mirror target generates I/O in its recovery phase and fails to wait for it. It needs to be tracked so we can ensure that it has completed before we suspend. [akpm@osdl.org: cleanup] Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: <dm-devel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-11-08 18:29:23 -08:00

... 3 4 5 6 7 ...

929 Commits