4937 Commits

Author SHA1 Message Date
NeilBrown
c230e7e535 md/raid1: simplify the splitting of requests.
raid1 currently splits requests in two different ways for
two different reasons.

First, bio_split() is used to ensure the bio fits within a
resync accounting region.
Second, multiple r1bios are allocated for each bio to handle
the possiblity of known bad blocks on some devices.

This can be simplified to just use bio_split() once, and not
use multiple r1bios.
We delay the split until we know a maximum bio size that can
be handled with a single r1bio, and then split the bio and
queue the remainder for later handling.

This avoids all loops inside raid1.c request handling.  Just
a single read, or a single set of writes, is submitted to
lower-level devices for each bio that comes from
generic_make_request().

When the bio needs to be split, generic_make_request() will
do the necessary looping and call md_make_request() multiple
times.

raid1_make_request() no longer queues request for raid1 to handle,
so we can remove that branch from the 'if'.

This patch also creates a new private bio_set
(conf->bio_split) for splitting bios.  Using fs_bio_set
is wrong, as it is meant to be used by filesystems, not
block devices.  Using it inside md can lead to deadlocks
under high memory pressure.

Delete unused variable in raid1_write_request() (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:07:27 -07:00
Artur Paszkiewicz
ae1713e296 raid5-ppl: partial parity calculation optimization
In case of read-modify-write, partial partity is the same as the result
of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
sh->ppl_page instead of calculating it again.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 12:01:37 -07:00
Artur Paszkiewicz
845b9e229f raid5-ppl: use resize_stripes() when enabling or disabling ppl
Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
or free sh->ppl_page at runtime for all stripes in the stripe cache.
raid5_reset_stripe_cache() required suspending the mddev and could
deadlock because of GFP_KERNEL allocations.

Move the 'newsize' check to check_reshape() to allow reallocating the
stripes with the same number of disks. Allocate sh->ppl_page in
alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
a parameter to alloc_stripe() because it is needed to check whether to
allocate ppl_page. Add free_stripe() and use it to free stripes rather
than directly call kmem_cache_free(). Also free sh->ppl_page in
free_stripe().

Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
setting it in advance and add another parameter to log_init() to allow
calling ppl_init_log() without the bit set. Don't try to calculate
partial parity or add a stripe to log if it does not have ppl_page set.

Enabling ppl can now be performed without suspending the mddev, because
the log won't be used until new stripes are allocated with ppl_page.
Calling mddev_suspend/resume is still necessary when disabling ppl,
because we want all stripes to finish before stopping the log, but
resize_stripes() can be called after mddev_resume() when ppl is no
longer active.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 12:00:49 -07:00
Artur Paszkiewicz
94568f64af raid5-ppl: move no_mem_stripes to struct ppl_conf
Use a single no_mem_stripes list instead of per member device lists for
handling stripes that need retrying in case of failed io_unit
allocation. Because io_units are allocated from a memory pool shared
between all member disks, the no_mem_stripes list should be checked when
an io_unit for any member is freed. This fixes a deadlock that could
happen if there are stripes in more than one no_mem_stripes list.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 12:00:27 -07:00
NeilBrown
0c9d5b127f md/raid1: avoid reusing a resync bio after error handling.
fix_sync_read_error() modifies a bio on a newly faulty
device by setting bi_end_io to end_sync_write.
This ensure that put_buf() will still call rdev_dec_pending()
as required, but makes sure that subsequent code in
fix_sync_read_error() doesn't try to read from the device.

Unfortunately this interacts badly with sync_request_write()
which assumes that any bio with bi_end_io set to non-NULL
other than end_sync_read is safe to write to.

As the device is now faulty it doesn't make sense to write.
As the bio was recently used for a read, it is "dirty"
and not suitable for immediate submission.
In particular, ->bi_next might be non-NULL, which will cause
generic_make_request() to complain.

Break this interaction by refusing to write to devices
which are marked as Faulty.

Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com>
Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
Cc: stable@vger.kernel.org (v4.10+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 11:05:26 -07:00
Zhilong Liu
b670883bb9 md.c:didn't unlock the mddev before return EINVAL in array_size_store
md.c: it needs to release the mddev lock before
the array_size_store() returns.

Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size are not supported")

Signed-off-by: Zhilong Liu <zlliu@suse.com>
Reviewed-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 10:50:24 -07:00
NeilBrown
065e519e71 md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop
if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
needs to be cleared in md_ioctl after any call to md_set_readonly()
or do_md_stop().

Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
Cc: stable@vger.kernel.org (v4.9+)
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 10:47:50 -07:00
Guoqing Jiang
6f287ca604 md/raid10: reset the 'first' at the end of loop
We need to set "first = 0' at the end of rdev_for_each
loop, so we can get the array's min_offset_diff correctly
otherwise min_offset_diff just means the last rdev's
offset diff.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 10:41:50 -07:00
NeilBrown
7471fb77ce md/raid6: Fix anomily when recovering a single device in RAID6.
When recoverying a single missing/failed device in a RAID6,
those stripes where the Q block is on the missing device are
handled a bit differently.  In these cases it is easy to
check that the P block is correct, so we do.  This results
in the P block be destroy.  Consequently the P block needs
to be read a second time in order to compute Q.  This causes
lots of seeks and hurts performance.

It shouldn't be necessary to re-read P as it can be computed
from the DATA.  But we only compute blocks on missing
devices, since c337869d9501 ("md: do not compute parity
unless it is on a failed drive").

So relax the change made in that commit to allow computing
of the P block in a RAID6 which it is the only missing that
block.

This makes RAID6 recovery run much faster as the disk just
"before" the recovering device is no longer seeking
back-and-forth.

Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 10:35:27 -07:00
Dennis Yang
583da48e38 md: update slab_cache before releasing new stripes when stripes resizing
When growing raid5 device on machine with small memory, there is chance that
mdadm will be killed and the following bug report can be observed. The same
bug could also be reproduced in linux-4.10.6.

[57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
[57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
[57600.114678] Oops: 0002 [#1] SMP
[57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
[57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
[57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
[57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
[57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
[57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
[57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
[57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
[57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
[57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
[57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
[57600.274942] Stack:
[57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
[57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
[57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
[57600.299254] Call Trace:
[57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
[57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
[57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
[57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
[57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
[57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
[57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
[57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
[57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
[57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
[57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
[57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
[57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
[57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
[57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
[57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
[57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
[57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
[57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
[57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
[57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
[57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.441208]  RSP <ffff880043073810>
[57600.444690] CR2: 0000000000000000
[57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---

The problem is that resize_stripes() releases new stripe_heads before assigning new
slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
after resize_stripes() starting releasing new stripes but right before new slab cache
being assigned, it is possible that these new stripe_heads will be freed with the old
slab_cache which was already been destoryed and that triggers this bug.

Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1+)
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10 09:27:12 -07:00
Linus Torvalds
78d91a75b4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's a pull request for 4.11-rc, fixing a set of issues mostly
  centered around the new scheduling framework. These have been brewing
  for a while, but split up into what we absolutely need in 4.11, and
  what we can defer until 4.12. These are well tested, on both single
  queue and multiqueue setups, and with and without shared tags. They
  fix several hangs that have happened in testing.

  This is obviously larger than I would have preferred at this point in
  time, but I don't think we can shave much off this and still get the
  desired results.

  In detail, this pull request contains:

   - a set of five fixes for NVMe, mostly from Christoph and one from
     Roland.

   - a series from Bart, fixing issues with dm-mq and SCSI shared tags
     and scheduling. Note that one of those patches commit messages may
     read like an optimization, but it is in fact an important fix for
     queue restarts in particular.

   - a series from Omar, most importantly fixing a hang with multiple
     hardware queues when we fail to get a driver tag. Another important
     fix in there is for resizing hardware queues, which nbd does when
     handling multiple sockets for one connection.

   - fixing an imbalance in putting the ctx for hctx request allocations
     from Minchan"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: Restart a single queue if tag sets are shared
  dm rq: Avoid that request processing stalls sporadically
  scsi: Avoid that SCSI queues get stuck
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: fix crash in switch error path
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: refactor scheduler initialization
  blk-mq: use the right hctx when getting a driver tag fails
  nvmet: fix byte swap in nvmet_parse_io_cmd
  nvmet: fix byte swap in nvmet_execute_write_zeroes
  nvmet: add missing byte swap in nvmet_get_smart_log
  nvme: add missing byte swap in nvme_setup_discard
  nvme: Correct NVMF enum values to match NVMe-oF rev 1.0
  block: do not put mq context in blk_mq_alloc_request_hctx
2017-04-08 11:56:58 -07:00
Christoph Hellwig
48920ff2a5 block: remove the discard_zeroes_data flag
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
615ec946ab dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
ac62d6208a dm: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
0f5d690f7b dm io: discards don't take a payload
Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
3deff1a70d md: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Jens Axboe
65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Bart Van Assche
6077c2d706 dm rq: Avoid that request processing stalls sporadically
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:27:10 -06:00
Linus Torvalds
81d4bab4ce - Two stable@ fixes for the verity target's FEC support
- A stable@ fix for raid target's raid1 support (when no bitmap is used)
 
 - A 4.11 cache metadata v2 format fix to properly test blocks are clean
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJY56smAAoJEMUj8QotnQNa3xYH/39l25eGzam0cnITa31cX9uu
 lb+oWnqbgvbd65HZr2QPu9RO8LQMK9wxw40wapyYTEnkDfgeW+hmwYo3BUZ0IpdT
 Ry39KGCGaxk3L3cATSgtZT18AsWRHmKqlHLf6y98RdeFLVb3lyUFllkLF9r3M2ep
 1Ga2MiMJYffaiTsSKxwZQG3XG7mq9MNfRnCehGAQwjGgWL3EsYHNsq+Hosn/tdtZ
 2D7BvAMr2X+3xEUVevqL2dFmJ1D2tbJjtedeAKVOccErV/BofwWPUvTOFX8202+Y
 CUC9pW+hDQqpCm15Pr4N6oU4TeC4mHMwGK0SLWmoXkl3VDPbUUO3qC5AwKxsepA=
 =cWkE
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - two stable fixes for the verity target's FEC support

 - a stable fix for raid target's raid1 support (when no bitmap is used)

 - a 4.11 cache metadata v2 format fix to properly test blocks are clean

* tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm verity fec: fix bufio leaks
  dm raid: fix NULL pointer dereference for raid1 without bitmap
  dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty
  dm verity fec: limit error correction recursion
2017-04-07 10:47:20 -07:00
NeilBrown
fbbaf700e7 block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 09:40:52 -06:00
Sami Tolvanen
86e3e83b44 dm verity fec: fix bufio leaks
Buffers read through dm_bufio_read() were not released in all code paths.

Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 15:44:25 -04:00
Joe Thornber
cc7e394024 dm cache policy smq: make the cleaner policy write-back more aggressively
By ignoring the sentinels the cleaner policy is able to write-back dirty
cache data much faster.  There is no reason to respect the sentinels,
which denote that a block was changed recently, when using the cleaner
policy given that the cleaner is tasked with writing back all dirty
data.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:41:05 -04:00
Joe Thornber
449b668ce0 dm cache: set/clear the cache core's dirty_bitset when loading mappings
When loading metadata make sure to set/clear the dirty bits in the cache
core's dirty_bitset as well as the policy.

Otherwise the cache core is unaware that any blocks were dirty when the
cache was last shutdown.  A very serious side-effect being that the
cleaner policy would therefore never be tasked with writing back dirty
data from a cache that was in writeback mode (e.g. when switching from
smq policy to cleaner policy when decommissioning a writeback cache).

This fixes a serious data corruption bug associated with writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:33:44 -04:00
Dmitry Bilunov
7a0c5c5b83 dm raid: fix NULL pointer dereference for raid1 without bitmap
Commit 4257e08 ("dm raid: support to change bitmap region size")
introduced a bitmap resize call during preresume phase. User can create
a DM device with "raid" target configured as raid1 with no metadata
devices to hold superblock/bitmap info. It can be achieved using the
following sequence:

  truncate -s 32M /dev/shm/raid-test
  LOOP=$(losetup --show -f /dev/shm/raid-test)
  dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
  dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
  dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"

This results in the following crash:

[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0
…
[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS:  00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291]  ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309]  ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329]  ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346]  raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361]  dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378]  dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391]  dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405]  ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419]  ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447]  do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459]  ? ____fput+0x9/0x10
[ 4029.115470]  ? task_work_run+0x79/0xa0
[ 4029.115481]  SyS_ioctl+0x3c/0x70
[ 4029.115493]  entry_SYSCALL_64_fastpath+0x13/0x94

The raid_preresume() function incorrectly assumes that the raid_set has
a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set.  But
RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap()
even if there is no bitmap present (and bitmap_load() happily returns 0
even if a bitmap isn't present).  So the only way forward in the
near-term is to check if the bitmap is present by seeing if
mddev->bitmap is not NULL after bitmap_load() has been called.

By doing so the above NULL pointer is avoided.

Fixes: 4257e08 ("dm raid: support to change bitmap region size")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:05:54 -04:00
Eric Biggers
f363b089be blk-mq: constify struct blk_mq_ops
Constify all instances of blk_mq_ops, as they are never modified.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31 08:28:58 -06:00
Mikulas Patocka
7b81ef8b14 dm raid: select the Kconfig option CONFIG_MD_RAID0
Since the commit 0cf4503174c1 ("dm raid: add support for the MD RAID0
personality"), the dm-raid subsystem can activate a RAID-0 array.
Therefore, add MD_RAID0 to the dependencies of DM_RAID, so that MD_RAID0
will be selected when DM_RAID is selected.

Fixes: 0cf4503174c1 ("dm raid: add support for the MD RAID0 personality")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-30 11:17:08 -04:00
Ming Lei
8fc04e6ea0 md: raid1: kill warning on powerpc_pseries
This patch kills the warning reported on powerpc_pseries,
and actually we don't need the initialization.

	After merging the md tree, today's linux-next build (powerpc
	pseries_le_defconfig) produced this warning:

	drivers/md/raid1.c: In function 'raid1d':
	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
	     if (memcmp(page_address(ppages[j]),
	         ^
	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
	   int page_len[RESYNC_PAGES];
       ^

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-28 08:49:52 -07:00
SeongJae Park
4f6cce3910 Fix dead URLs to ftp.kernel.org
URLs to ftp.kernel.org are still exist though the service is closed [0].
This commit fixes the URLs to use www.kernel.org instead.

[0] https://www.kernel.org/shutting-down-ftp-services.html

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-28 16:16:52 +02:00
Song Liu
0bb0c10500 md/raid5: use consistency_policy to remove journal feature
When journal device of an array fails, the array is forced into read-only
mode. To make the array normal without adding another journal device, we
need to remove journal _feature_ from the array.

This patch allows remove journal _feature_ from an array, For journal
existing journal should be either missing or faulty.

To remove journal feature, it is necessary to remove the journal device
first:

  mdadm --fail /dev/md0 /dev/sdb
  mdadm: set /dev/sdb faulty in /dev/md0
  mdadm --remove /dev/md0 /dev/sdb
  mdadm: hot removed /dev/sdb from /dev/md0

Then the journal feature can be removed by echoing into the sysfs file:

 cat /sys/block/md0/md/consistency_policy
 journal

 echo resync > /sys/block/md0/md/consistency_policy
 cat /sys/block/md0/md/consistency_policy
 resync

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-27 12:02:33 -07:00
Heinz Mauelshagen
6e53636fe8 dm raid: add raid4/5/6 journal write-back support via journal_mode option
Commit 63c32ed4afc ("dm raid: add raid4/5/6 journaling support") added
journal support to close the raid4/5/6 "write hole" -- in terms of
writethrough caching.

Introduce a "journal_mode" feature and use the new
r5c_journal_mode_set() API to add support for switching the journal
device's cache mode between write-through (the current default) and
write-back.

NOTE: If the journal device is not layered on resilent storage and it
fails, write-through mode will cause the "write hole" to reoccur.  But
if the journal fails while in write-back mode it will cause data loss
for any dirty cache entries unless resilent storage is used for the
journal.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 12:08:07 -04:00
Heinz Mauelshagen
4464e36e06 dm raid: fix table line argument order in status
Commit 3a1c1ef2f ("dm raid: enhance status interface and fixup
takeover/raid0") added new table line arguments and introduced an
ordering flaw.  The sequence of the raid10_copies and raid10_format
raid parameters got reversed which causes lvm2 userspace to fail by
falsely assuming a changed table line.

Sequence those 2 parameters as before so that old lvm2 can function
properly with new kernels by adjusting the table line output as
documented in Documentation/device-mapper/dm-raid.txt.

Also, add missing version 1.10.1 highlight to the documention.

Fixes: 3a1c1ef2f ("dm raid: enhance status interface and fixup takeover/raid0")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 11:45:26 -04:00
Heinz Mauelshagen
78e470c26f md: add raid4/5/6 journal mode switching API
Commit 2ded370373a4 ("md/r5cache: State machine for raid5-cache write
back mode") added support for "write-back" caching on the raid journal
device.

In order to allow the dm-raid target to switch between the available
"write-through" and "write-back" modes, provide a new
r5c_journal_mode_set() API.

Use the new API in existing r5c_journal_mode_store()

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Shaohua Li <shli@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 11:13:47 -04:00
Jason Yan
1ad45a9bc4 md/raid5-cache: fix payload endianness problem in raid5-cache
The payload->header.type and payload->size are little-endian, so just
convert them to the right byte order.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: <stable@vger.kernel.org> #v4.10+
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-25 09:38:22 -07:00
Shaohua Li
41743c1f04 md/raid1: skip data copy for behind io for discard request
discard request doesn't have data attached, so it's meaningless to
allocate memory and copy from original bio for behind IO. And the copy
is bogus because bio_copy_data_partial can't handle discard request.

We don't support writesame/writezeros request so far.

Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-25 09:38:06 -07:00
Mikulas Patocka
ff3af92b44 dm crypt: use shifts instead of sector_div
sector_div is very slow, so we introduce a variable sector_shift and
use shift instead of sector_div.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:24 -04:00
Mikulas Patocka
c2bcb2b702 dm integrity: add recovery mode
In recovery mode, we don't:
- replay the journal
- check checksums
- allow writes to the device

This mode can be used as a last resort for data recovery.  The
motivation for recovery mode is that when there is a single error in the
journal, the user should not lose access to the whole device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:23 -04:00
Mike Snitzer
1aa0efd421 dm integrity: factor out create_journal() from dm_integrity_ctr()
Preparation for next commit that makes call to create_journal()
optional.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:22 -04:00
Milan Broz
8f0009a225 dm crypt: optionally support larger encryption sector size
Add  optional "sector_size"  parameter that specifies encryption sector
size (atomic unit of block device encryption).

Parameter can be in range 512 - 4096 bytes and must be power of two.
For compatibility reasons, the maximal IO must fit into the page limit,
so the limit is set to the minimal page size possible (4096 bytes).

NOTE: this device cannot yet be handled by cryptsetup if this parameter
is set.

IV for the sector is calculated from the 512 bytes sector offset unless
the iv_large_sectors option is used.

Test script using dmsetup:

  DEV="/dev/sdb"
  DEV_SIZE=$(blockdev --getsz $DEV)
  KEY="9c1185a5c5e9fc54612808977ee8f548b2258d31ddadef707ba62c166051b9e3cd0294c27515f2bccee924e8823ca6e124b8fc3167ed478bca702babe4e130ac"
  BLOCK_SIZE=4096

  # dmsetup create test_crypt --table "0 $DEV_SIZE crypt aes-xts-plain64 $KEY 0 $DEV 0 1 sector_size:$BLOCK_SIZE"
  # dmsetup table --showkeys test_crypt

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:21 -04:00
Milan Broz
33d2f09fcb dm crypt: introduce new format of cipher with "capi:" prefix
For the new authenticated encryption we have to support generic composed
modes (combination of encryption algorithm and authenticator) because
this is how the kernel crypto API accesses such algorithms.

To simplify the interface, we accept an algorithm directly in crypto API
format.  The new format is recognised by the "capi:" prefix.  The
dmcrypt internal IV specification is the same as for the old format.

The crypto API cipher specifications format is:
     capi:cipher_api_spec-ivmode[:ivopts]
Examples:
     capi:cbc(aes)-essiv:sha256 (equivalent to old aes-cbc-essiv:sha256)
     capi:xts(aes)-plain64      (equivalent to old aes-xts-plain64)
Examples of authenticated modes:
     capi:gcm(aes)-random
     capi:authenc(hmac(sha256),xts(aes))-random
     capi:rfc7539(chacha20,poly1305)-random

Authenticated modes can only be configured using the new cipher format.
Note that this format allows user to specify arbitrary combinations that
can be insecure. (Policy decision is done in cryptsetup userspace.)

Authenticated encryption algorithms can be of two types, either native
modes (like GCM) that performs both encryption and authentication
internally, or composed modes where user can compose AEAD with separate
specification of encryption algorithm and authenticator.

For composed mode with HMAC (length-preserving encryption mode like an
XTS and HMAC as an authenticator) we have to calculate HMAC digest size
(the separate authentication key is the same size as the HMAC digest).
Introduce crypt_ctr_auth_cipher() to parse the crypto API string to get
HMAC algorithm and retrieve digest size from it.

Also, for HMAC composed mode we need to parse the crypto API string to
get the cipher mode nested in the specification.  For native AEAD mode
(like GCM), we can use crypto_tfm_alg_name() API to get the cipher
specification.

Because the HMAC composed mode is not processed the same as the native
AEAD mode, the CRYPT_MODE_INTEGRITY_HMAC flag is no longer needed and
"hmac" specification for the table integrity argument is removed.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:20 -04:00
Milan Broz
e889f97a3e dm crypt: factor IV constructor out to separate function
No functional change.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:19 -04:00
Milan Broz
ef43aa3806 dm crypt: add cryptographic data integrity protection (authenticated encryption)
Allow the use of per-sector metadata, provided by the dm-integrity
module, for integrity protection and persistently stored per-sector
Initialization Vector (IV).  The underlying device must support the
"DM-DIF-EXT-TAG" dm-integrity profile.

The per-bio integrity metadata is allocated by dm-crypt for every bio.

Example of low-level mapping table for various types of use:
 DEV=/dev/sdb
 SIZE=417792

 # Additional HMAC with CBC-ESSIV, key is concatenated encryption key + HMAC key
 SIZE_INT=389952
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 32 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-cbc-essiv:sha256 \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
 0 /dev/mapper/x 0 1 integrity:32:hmac(sha256)"

 # AEAD (Authenticated Encryption with Additional Data) - GCM with random IVs
 # GCM in kernel uses 96bits IV and we store 128bits auth tag (so 28 bytes metadata space)
 SIZE_INT=393024
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 28 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-gcm-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 0 /dev/mapper/x 0 1 integrity:28:aead"

 # Random IV only for XTS mode (no integrity protection but provides atomic random sector change)
 SIZE_INT=401272
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 16 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 0 /dev/mapper/x 0 1 integrity:16:none"

 # Random IV with XTS + HMAC integrity protection
 SIZE_INT=377656
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 48 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
 0 /dev/mapper/x 0 1 integrity:48:hmac(sha256)"

Both AEAD and HMAC protection authenticates not only data but also
sector metadata.

HMAC protection is implemented through autenc wrapper (so it is
processed the same way as an authenticated mode).

In HMAC mode there are two keys (concatenated in dm-crypt mapping
table).  First is the encryption key and the second is the key for
authentication (HMAC).  (It is userspace decision if these keys are
independent or somehow derived.)

The sector request for AEAD/HMAC authenticated encryption looks like this:
 |----- AAD -------|------ DATA -------|-- AUTH TAG --|
 | (authenticated) | (auth+encryption) |              |
 | sector_LE |  IV |  sector in/out    |  tag in/out  |

For writes, the integrity fields are calculated during AEAD encryption
of every sector and stored in bio integrity fields and sent to
underlying dm-integrity target for storage.

For reads, the integrity metadata is verified during AEAD decryption of
every sector (they are filled in by dm-integrity, but the integrity
fields are pre-allocated in dm-crypt).

There is also an experimental support in cryptsetup utility for more
friendly configuration (part of LUKS2 format).

Because the integrity fields are not valid on initial creation, the
device must be "formatted".  This can be done by direct-io writes to the
device (e.g. dd in direct-io mode).  For now, there is available trivial
tool to do this, see: https://github.com/mbroz/dm_int_tools

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Vashek Matyas <matyas@fi.muni.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:49:41 -04:00
Mikulas Patocka
7eada909bf dm: add integrity target
The dm-integrity target emulates a block device that has additional
per-sector tags that can be used for storing integrity information.

A general problem with storing integrity tags with every sector is that
writing the sector and the integrity tag must be atomic - i.e. in case of
crash, either both sector and integrity tag or none of them is written.

To guarantee write atomicity the dm-integrity target uses a journal. It
writes sector data and integrity tags into a journal, commits the journal
and then copies the data and integrity tags to their respective location.

The dm-integrity target can be used with the dm-crypt target - in this
situation the dm-crypt target creates the integrity data and passes them
to the dm-integrity target via bio_integrity_payload attached to the bio.
In this mode, the dm-crypt and dm-integrity targets provide authenticated
disk encryption - if the attacker modifies the encrypted device, an I/O
error is returned instead of random data.

The dm-integrity target can also be used as a standalone target, in this
mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:49:07 -04:00
Ming Lei
2d06e3b714 md: raid10: avoid direct access to bvec table in handle_reshape_read_error
All reshape I/O share pages from 1st copy device, so just use that pages
for avoiding direct access to bvec table in handle_reshape_read_error.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
cdb76be315 md: raid10: retrieve page from preallocated resync page array
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
f025061836 md: raid10: don't use bio's vec table to manage resync pages
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * copies
bytes per r10_bio, and it is fine because the inflight r10_bio for
resync shouldn't be much, as pointed by Shaohua.

Also bio_reset() in raid10_sync_request() and reshape_request()
are removed because all bios are freshly new now in these functions
and not necessary to reset any more.

This patch can be thought as cleanup too.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
81fa152008 md: raid10: refactor code of read reshape's .bi_end_io
reshape read request is a bit special and requires one extra
bio which isn't allocated from r10buf_pool.

Refactor the .bi_end_io for read reshape, so that we can use
raid10's resync page mangement approach easily in the following
patches.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
841c1316c7 md: raid1: improve write behind
This patch improve handling of write behind in the following ways:

- introduce behind master bio to hold all write behind pages
- fast clone bios from behind master bio
- avoid to change bvec table directly
- use bio_copy_data() and make code more clean

Suggested-by: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
d8c84c4f8b md: raid1: move 'offset' out of loop
The 'offset' local variable can't be changed inside the loop, so
move it out.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
60928a91b0 md: raid1: use bio helper in process_checks()
Avoid to direct access to bvec table.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
44cf0f4dc7 md: raid1: retrieve page from pre-allocated resync page array
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00