IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Factor out code to reserve log space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This is the only user, and keeping all code initializing the io_unit
structure together improves readbility.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Set up bi_sector properly when we allocate an bio instead of updating it
at submission time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Split out a helper to allocate a bio for log writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Remove the only partially used local 'io' variable to simplify the code
flow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
For devices without a volatile write cache we don't need to send a FLUSH
command to ensure writes are stable on disk, and thus can avoid the whole
step of batching up bios for processing by the MD thread.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
After this series we won't nessecarily have flushed the cache for these
I/Os, so give the list a more neutral name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array. The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback. Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.
Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.
This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Move reclaim stop to quiesce handling, where is safer for this stuff.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.
In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)
Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.
This applies on top of http://marc.info/?l=linux-raid&m=144122700510667
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.
V2: delete unnecessary type conversion as pointed out by Bart
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
The variable sctx->nr_regions has type unsigned long and the variable
nr_regions has type sector_t.
Thus the variables may be different when overflow happens.
Changed the conditional to "if (nr_regions >= ULONG_MAX)".
Also move the assignment of nr_regions after sector_div()
and the sanity check which looks more sane.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Only delay params are mentioned in delay.txt.
Mention offsets just like documents for linear and flakey do.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
All other error messages start capitalized.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
.map function of dm-delay returns return value of delay_bio(),
hence it's supposed to return using a defined DM_MAPIO macro.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Acked-By: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 72d94861 back in 2006 should have consistently removed
"dm-linear: " from all error messages.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_bm_unlock and dm_tm_unlock return an integer value but the returned
value is always 0. The calling code sometimes checks the return value
and sometimes doesn't.
Eliminate these unnecessary return values and also the checks for them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
generic_make_request handle arbitrarily sized bios") makes it possible
for block devices to process large bios. In doing so that commit
allocates a new queue->bio_split bioset for each block device, this
bioset is used for allocating bios when the driver needs to split large
bios.
Each bioset allocates a workqueue process, thus the above commit
increases the number of processes allocated per block device.
DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
This reduces the number of allocated processes per bio-based DM device
from 3 to 2. Also remove the call to blk_queue_split(), it is not
needed because DM does its own splitting.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
ffs counts bit starting with 1 (for the least significant bit), __ffs
counts bits starting with 0. This patch changes various occurrences of ffs
to __ffs and removes subtraction of 1 from the result.
Note that __ffs (unlike ffs) is not defined when called with zero
argument, but it is not called with zero argument in any of these cases.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove DM's unneeded NULL tests before calling these destroy functions,
now that they check for NULL, thanks to these v4.3 commits:
3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()")
4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()")
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This adds support to pass through persistent reservation requests
similar to the existing ioctl handling, and with the same limitations,
e.g. devices may only have a single target attached.
This is mostly intended for multipathing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This moves the call to blkdev_ioctl and the argument checking to DM core
code, and only leaves a callout to find the block device to operate on
in the targets. This simplifies the code and allows us to pass through
ioctl-like command using other methods in the next patch.
Also split out a helper around calling the prepare_ioctl method that
will be reused for persistent reservation handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit a1989b330093578ea5470bea0a00f940c444c466.
That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.
That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.
More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).
Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).
1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
$ cat sg_simple0.c
... see [3] ...
$ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
$ gcc sgio_inquiry.c -o sgio_inquiry
2) The ioctl() works fine with active paths present.
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 8:0:11:0 sdz 65:144 active undef running
| `- 9:0:9:0 sdbf 67:144 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 active undef running
`- 9:0:12:0 sdbo 68:32 active undef running
$ ./sgio_inquiry /dev/mapper/85ag56
Some of the INQUIRY command's response:
IBM 2145 0000
INQUIRY duration=0 millisecs, resid=0
3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
for unprivileged users (rather than blocking due to queue_if_no_path).
# for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
do multipathd -k"fail path $path"; done
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 8:0:11:0 sdz 65:144 failed undef running
| `- 9:0:9:0 sdbf 67:144 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 failed undef running
`- 9:0:12:0 sdbo 68:32 failed undef running
$ ./sgio_inquiry /dev/mapper/85ag56
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
$ dmesg
<...>
[] device-mapper: multipath: Failing path 65:144.
[] device-mapper: multipath: Failing path 67:144.
[] device-mapper: multipath: Failing path 65:224.
[] device-mapper: multipath: Failing path 68:32.
[] sgio_inquiry: sending ioctl 2285 to a partition!
5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
(then queueing happens -- in this example, queue_if_no_path is set);
this is due to a conditional check in scsi_verify_blk_ioctl().
# capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
# ./sgio_inquiry /dev/mapper/85ag56 &
[1] 72830
# cat /proc/72830/stack
[<c00000171c0df700>] 0xc00000171c0df700
[<c000000000015934>] __switch_to+0x204/0x350
[<c000000000152d4c>] msleep+0x5c/0x80
[<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
[<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
[<c0000000003128e4>] block_ioctl+0x64/0xd0
[<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
[<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
[<c000000000009358>] system_call+0x38/0xd0
6) This is the function call chain exercised in this analysis:
SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
-> do_vfs_ioctl()
-> vfs_ioctl()
...
error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
...
-> dm_blk_ioctl() @ drivers/md/dm.c
-> multipath_ioctl() @ drivers/md/dm-mpath.c
...
(bdev = NULL, due to no active paths)
...
if (!bdev || <...>) {
int err = scsi_verify_blk_ioctl(NULL, cmd);
if (err)
r = err;
}
...
-> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
...
if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
return 0;
...
if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
return 0;
...
printk_ratelimited(KERN_WARNING
"%s: sending ioctl %x to a partition!\n" <...>);
return -ENOIOCTLCMD;
<-
...
return r ? : <...>
<-
...
if (error == -ENOIOCTLCMD)
error = -ENOTTY;
out:
return error;
...
Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57.
This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.
If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to. So the
recovery must start again from the beginning.
If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there. Before this reversion, we only
did the "full recovery from there" which is not corect. After this
reversion with will do a full recovery from the start, which is safer
but not ideal.
It will be left to a future patch to arrange the two different styles
of recovery.
Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (3.14+)
Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")
After commit 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()")
__find_stripe() is called under conf->hash_locks + hash.
But handle_stripe_clean_event() calls remove_hash() under
conf->device_lock.
Under some cirscumstances the hash chain can be circuited,
and we get an infinite loop with disabled interrupts and locked hash
lock in __find_stripe(). This leads to hard lockup on multiple CPUs
and following system crash.
I was able to reproduce this behavior on raid6 over 6 ssd disks.
The devices_handle_discard_safely option should be set to enable trim
support. The following script was used:
for i in `seq 1 32`; do
dd if=/dev/zero of=large$i bs=10M count=100 &
done
neilb: original was against a 3.x kernel. I forward-ported
to 4.3-rc. This verison is suitable for any kernel since
Commit: 59fc630b8b5f ("RAID5: batch adjacent full stripe write")
(v4.1+). I'll post a version for earlier kernels to stable.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Fixes: 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()")
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org> # 3.13 - 4.2
Commit bfebd1cdb497a57757c83f5fbf1a29931591e2a4 ("dm: add full blk-mq
support to request-based DM") moves the initialization of the fields
backing_dev_info.congested_fn, backing_dev_info.congested_data and
queuedata from the function dm_init_md_queue (that is called when the
device is created) to dm_init_old_md_queue (that is called after the
device type is determined).
There is no locking when accessing these variables, thus it is possible
for other parts of the kernel to briefly see this data in a transient
state (e.g. queue->backing_dev_info.congested_fn initialized and
md->queue->backing_dev_info.congested_data uninitialized, resulting in
passing an incorrect parameter to the function dm_any_congested).
This queue data is left initialized for blk-mq devices even though they
that don't use it.
Fixes: bfebd1cdb497 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v4.1+
Two fixes for bugs that are in both raid1 and raid10.
Both related to bad-block-lists and at least one needs
to be back ported to 3.1.
Also a revision for the "new" layout in raid10.
This "new" code (which aims to improve robustness) actually
reduces robustness in some cases.
It probably isn't in use at all as not public user-space code
makes use of these new layouts.
However just in case someone has their own code, it would be
good to get the WARNing out for them sooner.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWKxc0AAoJEDnsnt1WYoG5BkIP/0DmbcISl9eSMQt+k9E5B/IN
hk2Z5Q3BeMi7VhjKJDsGsg+oQM1p2ef/vvfhx1lgk34jiVrELRPdzvIlnv0XeQ7y
NwMGkV7KKTbrzvK42eR6XhpO1UdJ3FIiC2RCH/5fmRai5JqgQ4jRWzX4wGDL2p5d
ZKK63KUjnlrqrLtch/kAxeynQbAWhtefzRKfspiUVtnaLD9sIhUwMS+IZDPYHbhd
YMowQEquQW0uEmiyX0j/XNgw14yLau5zXjSZ0SvtDfa+IAiAlHQWpxhatA3Vj0NA
xxKrUjYD41Rkzrm9dLfRtgG1U8Wq51q12wg6McY+i1glrR8d6AISe8PczfQsqRWu
TjKPHqfhENemSMHOxW+8NB9K6BXV7W/rCH4t6iUMG00KTGhVJNPt0T94yYI1p2Yc
Fs+dR6rYILS5whXbRpeLAVZ4Np53eka9O/Wo2qoujPgIOfNrG/Ed3Lfqylb7jk7Z
B8jalgn+99Bok9DuCg/HFtGLrU3KN/BjWdet9YX/Z8zBQifrfroATLOq5PJtuSpI
0STtZ6cOZYwvb70XC1w6eNPgQxz6rzJbPHDjwZ0woKYe4Bh+ZtCBJq3ufwJ/rqRV
OXmCceFO8KtK3/zJqeZOd0eNEkxiXDaKfJ0Ut6t7/kumCflE/tS4lyOpu8ptdT7s
hnATXrkvrL+6vtT1owAJ
=96PZ
-----END PGP SIGNATURE-----
Merge tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Some raid1/raid10 fixes.
I meant to get this to you before -rc7, but what with all the travel
plans..
Two fixes for bugs that are in both raid1 and raid10. Both related to
bad-block-lists and at least one needs to be back ported to 3.1.
Also a revision for the "new" layout in raid10. This "new" code
(which aims to improve robustness) actually reduces robustness in some
cases. It probably isn't in use at all as not public user-space code
makes use of these new layouts. However just in case someone has
their own code, it would be good to get the WARNing out for them
sooner"
* tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md:
md/raid10: fix the 'new' raid10 layout to work correctly.
md/raid10: don't clear bitmap bit when bad-block-list write fails.
md/raid1: don't clear bitmap bit when bad-block-list write fails.
md/raid10: submit_bio_wait() returns 0 on success
md/raid1: submit_bio_wait() returns 0 on success
This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.
The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n. If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong. The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2. recovery will not think meta 3 is a valid meta,
because its seq is wrong
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Next several patches use some raid5 functions, rename them with raid5
prefix and export out.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Add the following two macros for special roles: spare and faulty
MD_DISK_ROLE_SPARE 0xffff
MD_DISK_ROLE_FAULTY 0xfffe
Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role,
and minimal value of special role.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.
This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.
In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.
mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>