Fix dm-raid flush support.
Both md and dm have support for flush, but the dm-raid target
forgot to set the flag to indicate that flushes should be
passed on. (Important for data integrity e.g. with writeback cache
enabled.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The 'rebuild' parameter is used to rebuild individual devices in an
array (e.g. resynchronize a RAID1 device or recalculate a parity device
in higher RAID). The MD_CHANGE_DEVS flag must be set when this
parameter is given in order to write out the superblocks and make the
change take immediate effect. The code that handles new devices in
super_load already sets MD_CHANGE_DEVS and 'FirstUse'. (The 'FirstUse'
flag was being set as a special case for rebuilds in
super_init_validation.)
Add a condition for rebuilds in super_load to take care of both flags
without the special case in 'super_init_validation'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Correct the number of mapped sectors shown on a thin device's
status line by decrementing td->mapped_blocks in __remove() each time
a block is removed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If dm_sm_disk_create() fails the superblock must be unlocked.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The __open_device() error paths in __create_thin() and __create_snap()
incorrectly call __close_device() even if td was not initialized by
__open_device(). Remove this.
Also document __open_device() return values, remove a redundant
td->changed = 1 in __create_thin(), and insert an additional
safeguard against creating an already-existing device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The following BUG is hit on the first read that is submitted to a dm
flakey test device while the device is "down" if the corrupt_bio_byte
feature wasn't requested when the device's table was loaded.
Example DM table that will hit this BUG:
0 2097152 flakey 8:0 2048 0 30
This bug was introduced by commit a3998799fb
(dm flakey: add corrupt_bio_byte feature) in v3.1-rc1.
BUG: unable to handle kernel paging request at ffff8801cfce3fff
IP: [<ffffffffa008c233>] corrupt_bio_data+0x6e/0xae [dm_flakey]
PGD 1606063 PUD 0
Oops: 0002 [#1] SMP
...
Call Trace:
<IRQ>
[<ffffffffa008c2b5>] flakey_end_io+0x42/0x48 [dm_flakey]
[<ffffffffa00dca98>] clone_endio+0x54/0xb6 [dm_mod]
[<ffffffff81130587>] bio_endio+0x2d/0x2f
[<ffffffff811c819a>] req_bio_endio+0x96/0x9f
[<ffffffff811c94b9>] blk_update_request+0x1dc/0x3a9
[<ffffffff812f5ee2>] ? rcu_read_unlock+0x21/0x23
[<ffffffff811c96a6>] blk_update_bidi_request+0x20/0x6e
[<ffffffff811c9713>] blk_end_bidi_request+0x1f/0x5d
[<ffffffff811c978d>] blk_end_request+0x10/0x12
[<ffffffff8128f450>] scsi_io_completion+0x1e5/0x4b1
[<ffffffff812882a9>] scsi_finish_command+0xec/0xf5
[<ffffffff8128f830>] scsi_softirq_done+0xff/0x108
[<ffffffff811ce284>] blk_done_softirq+0x84/0x98
[<ffffffff81048d19>] __do_softirq+0xe3/0x1d5
[<ffffffff8138f83f>] ? _raw_spin_lock+0x62/0x69
[<ffffffff810997cf>] ? handle_irq_event+0x4c/0x61
[<ffffffff8139833c>] call_softirq+0x1c/0x30
[<ffffffff81003b37>] do_softirq+0x4b/0xa3
[<ffffffff81048a39>] irq_exit+0x53/0xca
[<ffffffff81398acd>] do_IRQ+0x9d/0xb4
[<ffffffff81390333>] common_interrupt+0x73/0x73
...
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch fixes a crash by recognising discards in dm_io.
Currently dm_mirror can send REQ_DISCARD bios if running over a
discard-enabled device and without support in dm_io the system
crashes badly.
BUG: unable to handle kernel paging request at 00800000
IP: __bio_add_page.part.17+0xf5/0x1e0
...
bio_add_page+0x56/0x70
dispatch_io+0x1cf/0x240 [dm_mod]
? km_get_page+0x50/0x50 [dm_mod]
? vm_next_page+0x20/0x20 [dm_mod]
? mirror_flush+0x130/0x130 [dm_mirror]
dm_io+0xdc/0x2b0 [dm_mod]
...
Introduced in 2.6.38-rc1 by commit 5fc2ffeabb
(dm raid1: support discard).
Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@kernel.org
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If 'argc' is zero we jump to the 'out:' label, but this leaks the
(unused) memory that 'dm_split_args()' allocated for 'argv' if the
string being split consisted entirely of whitespace. Jump to the
'out_argv:' label instead to free up that memory.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2 relate to the recently added drive replacement.
One causes read error in RAID10 to sometimes be retried indefinitely.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAT1VI1znsnt1WYoG5AQK47Q//d51y5QCpABFNUcgIM626zJXlBWFUSmzU
wFOGXh5emN6/TWguzkiZwrvcspDmXMzz1zmJtGWixYb2jBpn2MHEN4uNz3Vq68w+
IYk/dJg/CG4+lzX+6IjiHOb3+TASRx94QZHJASx68vypqniAyikshqcbUeZBMTB0
Fu+sKqsOGYmwQfe6/vtRPVXY7DYK2dFDBRMFpmOl+o4Y2XxmmWzMw4Dg1RIEdtFS
Jo9GwLHTnlw2xoc0XooufeT0Q2KOpqi9T8L6Nj0ORwpgsFqgtZ/kIOoGU6qOpSri
ofLTrobVKMpjFtmiYVOp9TaBlPnd/TNX3E4WPLGNsAwYuRUFjq8evmJKjG+pOdeB
3ArxRKRJCaI2jnVhH+NpT7i/tpkEg/8a/BoOAihX+hM/8QkmsWluaRBOGMhpuuuc
1baPVTusi/zijO9cM8RGIXaQj5UG4s3LUpCIOIYdDyxsfmAH5KN1F2EPrU4NMME2
96THSshIZLkgAg5ICwtva0qoHlBlEclAlVAzEomT7R9KwHojEB1xUiyMmaIdMFoy
JjGFAMp2E5+KBKZ1eYEHjthPWCb+nZ3eYHUh0DOnEt4kASCXnn45GJREQkpkNIR/
HhDTS8vI743unKnbCtYFMxiw/9OXZbMkdoZhobg7lxcpoQlWJ+5ziOtACl0h0Kv8
+ET+Kp3W8K4=
=93ms
-----END PGP SIGNATURE-----
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Three fixes for md in 3.3-rc: Two relate to the recently added drive
replacement. One fixes the problem where a read error in RAID10 would
sometimes be retried indefinitely."
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md/raid10: fix assembling of arrays with replacement devices.
md/raid10: fix handling of error on last working device in array.
md/raid1: fix buglet in md_raid1_contested.
commit 56a2559bb6 (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'. So it was largely ineffective.
So remove that now.
Signed-off-by: NeilBrown <neilb@suse.de>
If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.
This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
was the last usable device. This is efficient but a little racy
and will sometimes retry when it should not.
2/ in handle_read_error we are careful to exclude any device from
retry which we tried to mark as faulty (that might have failed if
it was the last device). This is race-free but less efficient.
Signed-off-by: NeilBrown <neilb@suse.de>
Since we added 'replacement' capability, RAID1 can have twice
as many devices as ->raid_disks indicates.
So md_raid1_congested needs to check that many possible devices,
not just ->raid_disks many.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATzMdiTnsnt1WYoG5AQKICw/9H3Xf/3crCCVRQ+yzSdZ1ZJH24Rps9O6W
8dLFN4/Ng/qxymWUMrgHAMq5MEEz2M3i7W+j23lFv6Oce06y8GJ4PpoYY5xlXCgO
SIU1BaO1JFHxQn89EQtP3iOn4AOiZvX0GUObR0P8KO1mMnLmN7cg8J1kBfmQiBKu
aXcUqqNvcywoix6ve4O/xgnZjd4IExxqG3W8U7CaIwExUDwaLY4NckxJcIJbIYy9
iapOGMUdcyr6xm819V/xE2DyAtfFCtvAk1hfW/dM4QQctran3MzQIRFn9RW+CwHU
ComEnv5ti/7g//JPXQArUPk4xgRHrMhqFcmmD8rozJ6FJDi8vw2e0BXaRLVqa0mK
1qSZkr0Ot3nwAdILzgSbNXQ0Y5OJgc9OLX5GGlVibTW2VTJYFgA7jAsnqq8PAJC5
sU5h2K3jrSy2unGy6BxleL5D/wvREE5OBnW35TEB5TYbxjp1FLgn+BWp8FfFUYWT
Eb2cIyAj6cBFJ3ma1K0RH0dmS9cbNjuG+CLiApJOnEEsXzrp/4KnqOwg4672ewW3
m1Ue2Qv+0avaK3sVyT+qzuemc6b0ps/dix0gMXw2pYqXQWHquW5NdUJcgD2DKFSn
BB734nUP6KlPg0IFh1eehRHyVRLIAot/uBlUJ3bMx9xeYCkKa+twX90u6EmjTopP
JjLxNsf6c2I=
=k0Xz
-----END PGP SIGNATURE-----
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Some simple md-related fixes.
1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md: two small fixes to handling interrupt resync.
Prevent DM RAID from loading bitmap twice.
1/ If a resync is aborted we should record how far we got
(recovery_cp) the last request that we know has completed
(->curr_resync_completed) rather than the last request that was
submitted (->curr_resync).
2/ When a resync aborts we still want to update the metadata with
any changes, so set MD_CHANGE_DEVS even if we 'skip'.
Signed-off-by: NeilBrown <neilb@suse.de>
The life cycle of a device-mapper target is:
1) create
2) resume
3) suspend
*) possibly repeat from 2
4) destroy
The dm-raid target is unconditionally calling MD's bitmap_load function upon
every resume. If steps 2 & 3 above are repeated, bitmap_load is called
multiple times. It is only written to be called once; otherwise, it allocates
new memory for the bitmap (without freeing the old) and incrementing the number
of pages it thinks it has without zeroing first. This ultimately leads to
access beyond allocated memory and lost memory.
Simply avoiding the bitmap_load call upon resume is not sufficient. If the
target was suspended while the initial recovery was only partially complete,
it needs to be restarted when the target is resumed. This is why
'md_wakeup_thread' is called before issuing the 'mddev_resume'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
A logical volume can map to just part of underlying physical volume.
In this case, it must be treated like a partition.
Based on a patch from Alasdair G Kergon.
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON. Has been tagged for -stable.
The other is minor missing functionality.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATwy6Sznsnt1WYoG5AQL+5A//TbTgElZaJ7IMY4q658afuRNtuWfevqTs
4EoSUvarwyZN20JxUd4dFTzLQ3nu3XVmwZsDBbpRs7+Dt2m7Efp4qytqrTxHb6SR
4gOr1KFXZi2rQFNpIg8T5+eyb+2VkbHGYffOtwS9TZnJqZZ4upffJi1EpJSfB1Bo
ilkO8wcaNKVWzTgnQo+JVOLQQyNENs12Xc0aLVA0dZC0a37qWJTbr75r7nrtLT7A
Gy783AG8JglRsr7AOVceqBVOpRonhFDz7G2hQqHg140m6i/GzDJrPtadovCtq7nt
U6/Po7qbOj5eOSGrVPwS1gJQOT7deAL7Eeu7dOpbzl1Cwysbhg63piMNyDs4P/gM
bFsR+LTbmZiaYs5G1oDwN/WTYLeq6cxY0IftShWdGoQwZRF/woJ7VAQSWNvHY8mg
Z+EbEL3sY40+8eBk7/umT0WxQ9wYjooS/9ZowQ2ktRmt82Dwv0LXzWNTSlwhWKKt
QBtv1er/psEKFqb2zDtlea8gDlKahaVNaiOK6RuY5CM5iBa4/zEmWVXS/i07LC7Z
cW9swD4J3AEKSolWHWYQJBmCsKy+rUp5t0mQ5e/O4+nhCDbfe+Da0OArg6b/ygMu
14RdyjOENxSqKi3IkCnToch+eNzCIm3ETaS2E0nSv996G+ShqsLtROOI9x9DXiu3
nyLxAnIVp8I=
=969y
-----END PGP SIGNATURE-----
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Two bugfixes for md.
One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON. Has been tagged for -stable.
The other is minor missing functionality.
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md/raid1: perform bad-block tests for WriteMostly devices too.
md: notify the 'degraded' sysfs attribute on failure.
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.
This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.
Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.
With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.
This bug was introduced in commit d2eb35acfd and so the patch
is suitable for 3.1.x and 3.2.x
Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.
Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.
Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATwUdyDnsnt1WYoG5AQJBExAAuFQrzt7CN32nmiqaLlJ1snBC/ixOSJf7
0X88YB+utO0qNIhiRTI1AulslRGms9pChyKmZZaxU2Hvzk1pTrFspsc8RlJTIv9S
Si43due99Np08/Kf5rjYulyH0fcug0qIqlrCiBvc36pOIHZzam6IzrEvKFf0dqbe
4vtLWuylEUiSEbN5gBordfTFZcxSPI/huMUgx6hEpdA4NtaPmN57/03d49k4RYgp
f5l9htuIqdMgHwNJRUDGMooifuvILC5HXINUbsCFT6KUF/bxA75nW7W3C2BUq0V/
CVb/ZoHFYtuaGBEcMUzN34k0DbHghPeSmlvXT4XWq7+gVNqGe33nbSJsx1oLXWr1
m/b3j4Ublv+VVd7L1Rr40vTRB6wN5/2uN7SiD6d83ppPD0TuAY8YvMHgoLZQmQvh
Ak4fEz07re+tueKhHwbi+1qMIw/ciQ9O/tI7r+AsVklkAxJNAfFKW6FOEFL2cjEW
h4rbg1z9sU+xD8G01LBnJ0to/ajrJ4ch4wV/raLgi4+dJ4Tt3+/tas6WJlxHbyFF
IiHCFM0+KcxLcwNkalZYg/zr5qu7EcwpPKLnc68m+LjXlYVWcnNHwv5WnOCHHQTw
6yurGGlKqBfvsb8zJJGSRWqEtRSYjDlZiMt10/u+H40HiCiLX//vSvVbZoFiwWu1
VzgEhzhvaYw=
=j00/
-----END PGP SIGNATURE-----
Merge tag 'md-3.3' of git://neil.brown.name/md
md update for 3.3
Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.
* tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
md/raid1: Mark device want_replacement when we see a write error.
md/raid1: If there is a spare and a want_replacement device, start replacement.
md/raid1: recognise replacements when assembling arrays.
md/raid1: handle activation of replacement device when recovery completes.
md/raid1: Allow a failed replacement device to be removed.
md/raid1: Allocate spare to store replacement devices and their bios.
md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.
md/raid10: If there is a spare and a want_replacement device, start replacement.
md/raid10: recognise replacements when assembling array.
md/raid10: Allow replacement device to be replace old drive.
md/raid10: handle recovery of replacement devices.
md/raid10: Handle replacement devices during resync.
md/raid10: writes should get directed to replacement as well as original.
md/raid10: allow removal of failed replacement devices.
md/raid10: preferentially read from replacement device if possible.
md/raid10: change read_balance to return an rdev
md/raid10: prepare data structures for handling replacement.
md/raid5: Mark device want_replacement when we see a write error.
md/raid5: If there is a spare and a want_replacement device, start replacement.
md/raid5: recognise replacements when assembling array.
...
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to add a spare to a RAID1 array, also consider
adding it as a replacement for a want_replacement device.
Signed-off-by: NeilBrown <neilb@suse.de>
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.
Signed-off-by: NeilBrown <neilb@suse.de>
When recovery completes ->spare_active is called.
This checks if the replacement is ready and if so it fails
the original.
Signed-off-by: NeilBrown <neilb@suse.de>
In RAID1, a replacement is much like a normal device, so we just
double the size of the relevant arrays and look at all possible
devices for reads and writes.
This means that the array looks like it is now double the size in some
way - we need to be careful about that.
In particular, we checking if the array is still degraded while
creating a recovery request we need to only consider the first 'half'
- i.e. the real (non-replacement) devices.
Signed-off-by: NeilBrown <neilb@suse.de>
In general mddev->raid_disks can change unexpectedly while
conf->raid_disks will only change in a very controlled way. So change
some uses of one to the other.
The use of mddev->raid_disks will not cause actually problems but
this way is more consistent and safer in the long term.
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to add a spare to a RAID10 array, also consider
adding it as a replacement for a want_replacement device.
Signed-off-by: NeilBrown <neilb@suse.de>
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.
Signed-off-by: NeilBrown <neilb@suse.de>
When recovery finish and spare_active is called, check for a
replace that might have just become fully synced and mark it
as such, marking the original as failed.
Then when the original is removed, move the replacement into
its position.
This means that 'replacement' and spontaneously become NULL in some
situations. Make sure we check for those.
It also means that 'rdev' and 'replacement' could appear to be
identical - check for that too.
Signed-off-by: NeilBrown <neilb@suse.de>
If there is a replacement device, then recover to it,
reading from any drives - maybe the one being replaced, maybe not.
Signed-off-by: NeilBrown <neilb@suse.de>
If we need to resync an array which has replacement devices,
we always write any block checked to every replacement.
If the resync was bitmap-based resync we will then complete the
replacement normally.
If it was a full resync, we mark the replacements as fully recovered
when the resync finishes so no further recovery is needed.
Signed-off-by: NeilBrown <neilb@suse.de>
When writing, we need to submit two writes, one to the original,
and one to the replacements - if there is a replacement.
If the write to the replacement results in a write error we just
fail the device. We only try to record write errors to the
original.
This only handles writing new data. Writing for resync/recovery
will come later.
Signed-off-by: NeilBrown <neilb@suse.de>
When reading (for array reads, not for recovery etc) we read from the
replacement device if it has recovered far enough.
This requires storing the chosen rdev in the 'r10_bio' so we can make
sure to drop the ref on the right device when the read finishes.
Signed-off-by: NeilBrown <neilb@suse.de>
It makes more sense to return an rdev than just an index as
read_balance() gets a reference to the rdev and so returning
the pointer make this more idiomatic.
This will be needed in a future patch when we might return
a 'replacement' rdev instead of the main rdev.
Signed-off-by: NeilBrown <neilb@suse.de>
Allow each slot in the RAID10 to have 2 devices, the want_replacement
and the replacement.
Also an r10bio to have 2 bios, and for resync/recovery allocate the
second bio if there are any replacement devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.
This requires that common md code attempt hot_add even when the array
is not formally degraded.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.
Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.
This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.
Signed-off-by: NeilBrown <neilb@suse.de>
During recovery we want to write to the replacement but not
the original. So we have two new flags
- R5_NeedReplace if this stripe has a replacement that needs to
be written at some stage
- R5_WantReplace if NeedReplace, and the data is available, and
a 'sync' has been requested on this stripe.
We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.
Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.
Signed-off-by: NeilBrown <neilb@suse.de>
When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.
If the write to the replacement results in a write error, we just fail
the device. We only try to record write errors to the original.
When writing for recovery, we shouldn't write to the original. This
will be addressed in a subsequent patch that generally addresses
recovery.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.
If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.
This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache. It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.
Signed-off-by: NeilBrown <neilb@suse.de>
We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.
Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>