5394 Commits

Author SHA1 Message Date
Ondrej Kozina
dc94902bde dm crypt: wipe kernel key copy after IV initialization
Loading key via kernel keyring service erases the internal
key copy immediately after we pass it in crypto layer. This is
wrong because IV is initialized later and we use wrong key
for the initialization (instead of real key there's just zeroed
block).

The bug may cause data corruption if key is loaded via kernel keyring
service first and later same crypt device is reactivated using exactly
same key in hexbyte representation, or vice versa. The bug (and fix)
affects only ciphers using following IVs: essiv, lmk and tcw.

Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service")
Cc: stable@vger.kernel.org # 4.10+
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:10:48 -05:00
Mikulas Patocka
717f4b1c52 dm integrity: don't store cipher request on the stack
Some asynchronous cipher implementations may use DMA.  The stack may
be mapped in the vmalloc area that doesn't support DMA.  Therefore,
the cipher request and initialization vector shouldn't be on the
stack.

Fix this by allocating the request and iv with kmalloc.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:08:57 -05:00
Milan Broz
27c7003697 dm crypt: fix crash by adding missing check for auth key size
If dm-crypt uses authenticated mode with separate MAC, there are two
concatenated part of the key structure - key(s) for encryption and
authentication key.

Add a missing check for authenticated key length.  If this key length is
smaller than actually provided key, dm-crypt now properly fails instead
of crashing.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Reported-by: Salah Coronya <salahx@yahoo.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:08:41 -05:00
Joe Thornber
bc68d0a435 dm btree: fix serious bug in btree_split_beneath()
When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.

This is a serious bug, and quite hard to make trigger in normal use.  A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
  https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.

Cc: stable@vger.kernel.org
Reported-by: Monty Pavel <monty_pavel@sina.com>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:07:55 -05:00
Dennis Yang
490ae017f5 dm thin metadata: THIN_MAX_CONCURRENT_LOCKS should be 6
For btree removal, there is a corner case that a single thread
could takes 6 locks which is more than THIN_MAX_CONCURRENT_LOCKS(5)
and leads to deadlock.

A btree removal might eventually call
rebalance_children()->rebalance3() to rebalance entries of three
neighbor child nodes when shadow_spine has already acquired two
write locks. In rebalance3(), it tries to shadow and acquire the
write locks of all three child nodes. However, shadowing a child
node requires acquiring a read lock of the original child node and
a write lock of the new block. Although the read lock will be
released after block shadowing, shadowing the third child node
in rebalance3() could still take the sixth lock.
(2 write locks for shadow_spine +
 2 write locks for the first two child nodes's shadow +
 1 write lock for the last child node's shadow +
 1 read lock for the last child node)

Cc: stable@vger.kernel.org
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:07:54 -05:00
Tomasz Majchrzak
1532d9e87e raid5-ppl: PPL support for disks with write-back cache enabled
In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-01-15 14:29:42 -08:00
Mike Snitzer
c100ec49fd dm: fix incomplete request_queue initialization
DM is no longer prone to having its request_queue be improperly
initialized.

Summary of changes:

- defer DM's blk_register_queue() from add_disk()-time until
  dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().

- dm_setup_md_queue() is updated to fully initialize DM's request_queue
  (_after_ all table loads have occurred and the request_queue's type,
  features and limits are known).

A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.

In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().

These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/

In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15 08:54:32 -07:00
Keith Busch
a1275677f8 dm mpath: Use blk_path_error
Uses common code for determining if an error should be retried on
alternate path.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 10:52:20 -07:00
Michael Lyle
3609c471a1 bcache: closures: move control bits one bit right
Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 12:18:51 -07:00
Michael Lyle
616486ab52 bcache: fix writeback target calc on large devices
Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size.  However, there still needs to be change
to make this "fair" over many volumes where some are idle.

Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Coly Li
5138ac6748 bcache: fix misleading error message in bch_count_io_errors()
Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Coly Li
2831231d4c bcache: reduce cache_set devices iteration by devices_max_used
Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
	bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Zhai Zhaoxuan
b40503ea4f bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()
The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.

It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.

This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.

Signed-off-by: Zhai Zhaoxuan <kxuanobj@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Kent Overstreet
ce439bf78b bcache: mark closure_sync() __sched
[edit by mlyle: include sched/debug.h to get __sched]

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Kent Overstreet
e4bf791937 bcache: Fix, improve efficiency of closure_sync()
Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.

[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state.  Changed __set_current_state to
set_current_state per Jens review comment]

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Michael Lyle
b1092c9af9 bcache: allow quick writeback when backing idle
If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Michael Lyle
6e6ccc67b9 bcache: writeback: properly order backing device IO
Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Tested-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
539d39eb27 bcache: fix wrong return value in bch_debug_init()
in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
4eca1cb28d bcache: segregate flash only volume write streams
In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Vasyl Gomonovych
9d13411784 bcache: Use PTR_ERR_OR_ZERO()
Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Tang Junhui
8d29c4426b bcache: stop writeback thread after detaching
Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Rui Hua
b221fc130c bcache: ret IOERR when read meets metadata error
The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08 13:29:00 -07:00
Mike Snitzer
0001ec565d dm mpath: factor out SCSI vs NVMe path selection
Trying to do both SCSI and NVMe bio-based handling with branching in the
same common code has proven too tedious on a code maintenance level.  In
addition it slightly hurts IO performance.

Fix this by factoring out __map_bio() and __map_bio_nvme().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06 11:23:28 -05:00
Mike Snitzer
848b8aefd4 dm mpath: optimize NVMe bio-based support
All code that deals with pg_init is not used with bio-based NVMe mode.
This includes skipping initialization of pg_init related variables.

Also, pg_init related members on 'struct multipath' have been grouped
together.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06 11:23:24 -05:00
Ming Lei
92681eca61 dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()
The bio is always freed after running crypt_free_buffer_pages(), so it
isn't necessary to clear bv->bv_page.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
25d8be77e1 block: move bio_alloc_pages() to bcache
bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it being misused in the future.

Also rename it to bch_bio_allo_pages() since it is bcache only.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
c2421edf5f bcache: comment on direct access to bvec table
All direct access to bvec table are safe even after multipage bvec is
supported.

Cc: linux-bcache@vger.kernel.org
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
8f50e35815 dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
For BIO based DM, some targets aren't ready for dealing with bigger
incoming bio than 1Mbyte, such as crypt target.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei
263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Mike Snitzer
cd02538445 dm mpath: implement NVMe bio-based support
This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH
to not be set.  In the future hopefully NVMe multipath and DM multipath
can co-exist more seemlessly.  But as is, if CONFIG_NVME_MULTIPATH=Y
then all the individal NVMe paths will remain hidden to upper layers and
as such DM multipath will not be able to manage them.

Though NVMe's native multipathing doesn't multipath namespaces across
subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y
and also use DM multipath to multipath across subsystems.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-04 19:00:19 -05:00
Mike Snitzer
1836df0891 dm mpath: move dm_bio_restore out of endio method
Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-03 09:27:40 -05:00
Song Liu
92e6245dea md/r5cache: print more info of log recovery
Log recovery is critical for raid5 journal/cache. Printing information
about each recovery by default will help the system admin monitor the
status of the array.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-20 08:39:26 -08:00
Mike Snitzer
d07a241d4f dm mpath: optimize retrieval of bio_details from per-bio-data
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:13 -05:00
Mike Snitzer
d0442f8039 dm mpath: remove unnecessary memset() calls for per-io-data
All underlying members are initialized directly so the memset() calls
are not needed.  Also, initialize mpio->nr_bytes from the start since it
never changes.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:12 -05:00
Mike Snitzer
63f6e6fd05 dm mpath: remove unused param from multipath_init_per_bio_data()
'struct dm_bio_details *' isn't ever needed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:12 -05:00
Mike Snitzer
978e51ba38 dm: optimize bio-based NVMe IO submission
Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request().  In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time.  There is no splitting needed.  Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:11 -05:00
Mike Snitzer
22c11858e8 dm: introduce DM_TYPE_NVME_BIO_BASED
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions.  Also,
the table has a single immutable target that doesn't require DM core to
split bios.

This will enable adding NVMe optimizations to bio-based DM.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20 10:51:10 -05:00
Mike Snitzer
f3986374f9 dm: simplify start of block stats accounting for bio-based
No apparent need to generic_start_io_acct() until before the IO is ready
for submission.  start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.

Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-17 12:05:32 -05:00
Mike Snitzer
bc02cdbe53 dm: remove redundant mapped_device member from clone_info structure
'struct dm_io' already has the same pointer.  So update all accesses
from ci->md to ci->io->md.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:15 -05:00
Mike Snitzer
dde1e1ec4c dm: remove now unused bio-based io_pool and _io_cache
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:14 -05:00
Mike Snitzer
64f52b0e31 dm: improve performance by moving dm_io structure to per-bio-data
Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from.  As such, it saves an extra mempool allocation for each
original bio that DM core is issued.

This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data.  But
in the end this provides a decent performance improvement for all
bio-based DM devices.

On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:13 -05:00
Mike Snitzer
745dc570b2 dm: rename 'bio' member of dm_io structure to 'orig_bio'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:12 -05:00
Mike Snitzer
2abf1fc91d dm: remove stale comment blocks
These CRUD comments have worn out their welcome.  The code is what it
is, over time it'll hopefully get better.  But these comments serve no
purpose whatsoever.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16 20:43:11 -05:00
Linus Torvalds
ee1b43ece1 - Fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.
- Fix various targets to dm_register_target after module __init
   resources created; otherwise racing lvm2 commands could result in a
   NULL pointer during initialization of associated DM kernel module.
 
 - Fix regression in bio-based DM multipath queue_if_no_path handling.
 
 - Fix DM bufio's shrinker to reclaim more than one buffer per scan.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaNBomAAoJEMUj8QotnQNao5IH/0X0Auycfx2O8dkVoRhW1Q3x
 NNt7m6aKhmdUsBYOug9/na5kNKqsRzKyPYSV9bM0Cy5mJzgxQYMeL5Tmu2qwGDOL
 1C/HUhffmZJQ+lK5dS2wQ41Ep+lppm8KYofJ70Ueb+JQ9Uxmkp9GXGud0LrJ0QzR
 9D5i/3jAlZuOnGLQ0+Q0E9wXa8sQdfrAbcPzz+4nG9aqGcz2T5lfbwg1K+Ym0U3r
 0jBAHZWhamJQP1gW1+i0EWWtR68TgaWbHeTjrdvm2pUueAaywJzP9oeK++p3Op+9
 A2JRE3I4ClAkUBjj480UAJW8Egg6zZ1mfOKta/CpqChbVqqjANi9lGSyRlkYMFg=
 =oNN+
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.

 - fix various targets to dm_register_target after module __init
   resources created; otherwise racing lvm2 commands could result in a
   NULL pointer during initialization of associated DM kernel module.

 - fix regression in bio-based DM multipath queue_if_no_path handling.

 - fix DM bufio's shrinker to reclaim more than one buffer per scan.

* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
  dm mpath: fix bio-based multipath queue_if_no_path handling
  dm: fix various targets to dm_register_target after module __init resources created
  dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
2017-12-15 12:53:37 -08:00
Mike Snitzer
ad3793fc39 dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions()
Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:33:32 -05:00
Mike Snitzer
3d7f45625a dm: fix __send_changing_extent_only() to send first bio and chain remainder
__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk".  That is: submit first bio up to split boundary and then
split the remainder to further submissions.

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:16:01 -05:00
Mike Snitzer
0776aa0e30 dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs
alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:16:00 -05:00
Mike Snitzer
4a3f54d94d dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure
Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:59 -05:00
Mike Snitzer
318716ddea dm: safely allocate multiple bioset bios
DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.

Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().

Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:58 -05:00
NeilBrown
f31c21e436 dm: remove unused 'num_write_bios' target interface
No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.

Having the possibility of num_write_bios > 1 complicates bio
allocation.  So remove the interface and assume there is only one bio
needed.

If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 12:15:58 -05:00