5329 Commits

Author SHA1 Message Date
Michael Lyle
b1092c9af9 bcache: allow quick writeback when backing idle
If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Michael Lyle
6e6ccc67b9 bcache: writeback: properly order backing device IO
Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Tested-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
539d39eb27 bcache: fix wrong return value in bch_debug_init()
in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
4eca1cb28d bcache: segregate flash only volume write streams
In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Vasyl Gomonovych
9d13411784 bcache: Use PTR_ERR_OR_ZERO()
Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
8d29c4426b bcache: stop writeback thread after detaching
Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Rui Hua
b221fc130c bcache: ret IOERR when read meets metadata error
The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Mike Snitzer
0001ec565d dm mpath: factor out SCSI vs NVMe path selection
Trying to do both SCSI and NVMe bio-based handling with branching in the
same common code has proven too tedious on a code maintenance level.  In
addition it slightly hurts IO performance.

Fix this by factoring out __map_bio() and __map_bio_nvme().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06 11:23:28 -05:00
Mike Snitzer
848b8aefd4 dm mpath: optimize NVMe bio-based support
All code that deals with pg_init is not used with bio-based NVMe mode.
This includes skipping initialization of pg_init related variables.

Also, pg_init related members on 'struct multipath' have been grouped
together.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06 11:23:24 -05:00
Ming Lei
92681eca61 dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()
The bio is always freed after running crypt_free_buffer_pages(), so it
isn't necessary to clear bv->bv_page.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
25d8be77e1 block: move bio_alloc_pages() to bcache
bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it being misused in the future.

Also rename it to bch_bio_allo_pages() since it is bcache only.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
c2421edf5f bcache: comment on direct access to bvec table
All direct access to bvec table are safe even after multipage bvec is
supported.

Cc: linux-bcache@vger.kernel.org
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
8f50e35815 dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
For BIO based DM, some targets aren't ready for dealing with bigger
incoming bio than 1Mbyte, such as crypt target.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Mike Snitzer
cd02538445 dm mpath: implement NVMe bio-based support
This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH
to not be set.  In the future hopefully NVMe multipath and DM multipath
can co-exist more seemlessly.  But as is, if CONFIG_NVME_MULTIPATH=Y
then all the individal NVMe paths will remain hidden to upper layers and
as such DM multipath will not be able to manage them.

Though NVMe's native multipathing doesn't multipath namespaces across
subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y
and also use DM multipath to multipath across subsystems.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-04 19:00:19 -05:00
Mike Snitzer
1836df0891 dm mpath: move dm_bio_restore out of endio method
Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-03 09:27:40 -05:00
Song Liu
92e6245dea md/r5cache: print more info of log recovery
Log recovery is critical for raid5 journal/cache. Printing information
about each recovery by default will help the system admin monitor the
status of the array.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-20 08:39:26 -08:00
Mike Snitzer
d07a241d4f dm mpath: optimize retrieval of bio_details from per-bio-data
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:13 -05:00
Mike Snitzer
d0442f8039 dm mpath: remove unnecessary memset() calls for per-io-data
All underlying members are initialized directly so the memset() calls
are not needed.  Also, initialize mpio->nr_bytes from the start since it
never changes.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:12 -05:00
Mike Snitzer
63f6e6fd05 dm mpath: remove unused param from multipath_init_per_bio_data()
'struct dm_bio_details *' isn't ever needed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:12 -05:00
Mike Snitzer
978e51ba38 dm: optimize bio-based NVMe IO submission
Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request().  In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time.  There is no splitting needed.  Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:11 -05:00
Mike Snitzer
22c11858e8 dm: introduce DM_TYPE_NVME_BIO_BASED
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions.  Also,
the table has a single immutable target that doesn't require DM core to
split bios.

This will enable adding NVMe optimizations to bio-based DM.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:10 -05:00
Mike Snitzer
f3986374f9 dm: simplify start of block stats accounting for bio-based
No apparent need to generic_start_io_acct() until before the IO is ready
for submission.  start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.

Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-17 12:05:32 -05:00
Mike Snitzer
bc02cdbe53 dm: remove redundant mapped_device member from clone_info structure
'struct dm_io' already has the same pointer.  So update all accesses
from ci->md to ci->io->md.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:15 -05:00
Mike Snitzer
dde1e1ec4c dm: remove now unused bio-based io_pool and _io_cache
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:14 -05:00
Mike Snitzer
64f52b0e31 dm: improve performance by moving dm_io structure to per-bio-data
Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from.  As such, it saves an extra mempool allocation for each
original bio that DM core is issued.

This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data.  But
in the end this provides a decent performance improvement for all
bio-based DM devices.

On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:13 -05:00
Mike Snitzer
745dc570b2 dm: rename 'bio' member of dm_io structure to 'orig_bio'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:12 -05:00
Mike Snitzer
2abf1fc91d dm: remove stale comment blocks
These CRUD comments have worn out their welcome.  The code is what it
is, over time it'll hopefully get better.  But these comments serve no
purpose whatsoever.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:11 -05:00
Linus Torvalds
ee1b43ece1 - Fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.
- Fix various targets to dm_register_target after module __init
   resources created; otherwise racing lvm2 commands could result in a
   NULL pointer during initialization of associated DM kernel module.
 
 - Fix regression in bio-based DM multipath queue_if_no_path handling.
 
 - Fix DM bufio's shrinker to reclaim more than one buffer per scan.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaNBomAAoJEMUj8QotnQNao5IH/0X0Auycfx2O8dkVoRhW1Q3x
 NNt7m6aKhmdUsBYOug9/na5kNKqsRzKyPYSV9bM0Cy5mJzgxQYMeL5Tmu2qwGDOL
 1C/HUhffmZJQ+lK5dS2wQ41Ep+lppm8KYofJ70Ueb+JQ9Uxmkp9GXGud0LrJ0QzR
 9D5i/3jAlZuOnGLQ0+Q0E9wXa8sQdfrAbcPzz+4nG9aqGcz2T5lfbwg1K+Ym0U3r
 0jBAHZWhamJQP1gW1+i0EWWtR68TgaWbHeTjrdvm2pUueAaywJzP9oeK++p3Op+9
 A2JRE3I4ClAkUBjj480UAJW8Egg6zZ1mfOKta/CpqChbVqqjANi9lGSyRlkYMFg=
 =oNN+
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.

 - fix various targets to dm_register_target after module __init
   resources created; otherwise racing lvm2 commands could result in a
   NULL pointer during initialization of associated DM kernel module.

 - fix regression in bio-based DM multipath queue_if_no_path handling.

 - fix DM bufio's shrinker to reclaim more than one buffer per scan.

* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
  dm mpath: fix bio-based multipath queue_if_no_path handling
  dm: fix various targets to dm_register_target after module __init resources created
  dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
2017-12-15 12:53:37 -08:00
Mike Snitzer
ad3793fc39 dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions()
Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:33:32 -05:00
Mike Snitzer
3d7f45625a dm: fix __send_changing_extent_only() to send first bio and chain remainder
__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk".  That is: submit first bio up to split boundary and then
split the remainder to further submissions.

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:16:01 -05:00
Mike Snitzer
0776aa0e30 dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs
alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:16:00 -05:00
Mike Snitzer
4a3f54d94d dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure
Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:59 -05:00
Mike Snitzer
318716ddea dm: safely allocate multiple bioset bios
DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.

Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().

Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:58 -05:00
NeilBrown
f31c21e436 dm: remove unused 'num_write_bios' target interface
No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.

Having the possibility of num_write_bios > 1 complicates bio
allocation.  So remove the interface and assume there is only one bio
needed.

If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:58 -05:00
NeilBrown
18a25da843 dm: ensure bio submission follows a depth-first tree walk
A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.

The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner.  Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.

In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.

This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.

dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available.  It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request().  This requires the 'rescue'
functionality provided by dm_offload_{start,end}.

Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach.  That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately.  This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.

__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work().  When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.

When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.  Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset.  This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

To provide the clone bio when splitting, we use q->bio_split.  This
was previously being freed by bio-based dm to avoid having excess
rescuer threads.  As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:57 -05:00
NeilBrown
c110a4b6e6 dm io: remove BIOSET_NEED_RESCUER flag from bios bioset
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm_io() is called from make_request_fn() context.  The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent.  But
there the code uses a separate thread to avoid problems.

So BIOSET_NEED_RESCUER is not needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:56 -05:00
NeilBrown
80cd175783 dm crypt: remove BIOSET_NEED_RESCUER flag
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.

So BIOSET_NEED_RESCUER is not needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:55 -05:00
NeilBrown
c06b3e5837 dm: fix comment above dm_accept_partial_bio
Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:54 -05:00
Heinz Mauelshagen
552aa679f2 dm raid: use rs_is_raid*()
Cleanup, no functional change.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:47 -05:00
Heinz Mauelshagen
7c29744ecc dm raid: simplify rs_get_progress()
No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 11:59:21 -05:00
Heinz Mauelshagen
dc15b943d4 dm raid: ensure 'a' chars during reshape
During reshape, 'A' chars were reported in status rather than 'a'.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 11:57:36 -05:00
Heinz Mauelshagen
11e4723206 dm raid: stop keeping raid set frozen altogether
In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared.  The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether.  This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 11:52:02 -05:00
Heinz Mauelshagen
53bf5384f9 dm raid: validate current raid sets redundancy
Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).

Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.

Also check for bogus data copies.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 11:50:52 -05:00
NeilBrown
474beb575c md/raid1,raid10: silence warning about wait-within-wait
If you prepare_to_wait() after a previous prepare_to_wait(),
but before calling schedule(), you get warning:

  do not call blocking ops when !TASK_RUNNING; state=2

This is appropriate as it is often a bug.  The event that the
first prepare_to_wait() expects might wake up the schedule following
the second prepare_to_wait(), which could be confusing.

However if both prepare_to_wait()s are part of simple wait_event()
loops, and if the inner one is rarely called, then there is
no problem.  The inner loop is too simple to get confused by
a stray wakeup, and the outer loop won't spin unduly because the
inner doesnt affect it often.

This pattern occurs in both raid1.c and raid10.c in the use of
flush_pending_writes().

The warning can be silenced by setting current->state to TASK_RUNNING.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-11 08:52:34 -08:00
Song Liu
d5d885fd51 md: introduce new personality funciton start()
In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-11 08:52:34 -08:00
Mike Snitzer
b84cf26924 dm raid: bump target version to reflect numerous fixes
Also update Documentation accordingly.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08 10:59:58 -05:00
Heinz Mauelshagen
78a75d10ef dm raid: small cleanup and remove unsed "struct raid_set" member
Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().

Also, remove unused 'bitmap_loaded' member from "struct raid_set".

No functional changes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08 10:59:58 -05:00
Heinz Mauelshagen
4102d9de6d dm raid: fix rs_get_progress() synchronization state/ratio
Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.

Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
  "0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
  "0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"

Sync action could be non-idle, when the MD thread was done with io.

Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08 10:59:58 -05:00
Heinz Mauelshagen
242ea5ad11 dm raid: avoid passing array_in_sync variable to raid_status() callees
The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array.  Replace it with a
runtime flag.  This will avoid a pattern of having to pass discrete
variables to various functions.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08 10:59:58 -05:00