linux/block
Mike Snitzer 6acfe68bac dm: fix excessive dm-mq context switching
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-22 11:04:40 -05:00
..
partitions mac: validate mac_partition is within sector 2015-11-20 08:49:28 -07:00
badblocks.c block, badblocks: introduce devm_init_badblocks 2016-01-09 08:39:04 -08:00
bio-integrity.c blk-integrity: checking for NULL instead of IS_ERR 2015-12-09 08:04:29 -07:00
bio.c bio: return EINTR if copying to user space got interrupted 2016-02-12 08:17:41 -07:00
blk-cgroup.c block: fix module reference leak on put_disk() call for cgroups throttle 2016-02-09 12:33:35 -07:00
blk-core.c dm: fix excessive dm-mq context switching 2016-02-22 11:04:40 -05:00
blk-exec.c block: move PM request support to IDE 2015-05-05 13:40:42 -06:00
blk-flush.c Revert "blk-flush: Queue through IO scheduler when flush not required" 2015-11-25 11:02:02 -07:00
blk-integrity.c block, libnvdimm, nvme: provide a built-in blk_integrity nop profile 2015-10-21 14:43:45 -06:00
blk-ioc.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-lib.c block: re-add discard_granularity and alignment checks 2015-10-28 09:12:58 +09:00
blk-map.c block: Copy a user iovec if it includes gaps 2015-09-11 09:03:50 -06:00
blk-merge.c block: fix bio splitting on max sectors 2016-01-22 20:42:30 -07:00
blk-mq-cpu.c
blk-mq-cpumap.c blk-mq: Avoid memoryless numa node encoded in hctx numa_node 2015-12-03 09:56:27 -07:00
blk-mq-sysfs.c block: add block polling support 2015-11-07 10:40:47 -07:00
blk-mq-tag.c blk-mq: add a flags parameter to blk_mq_alloc_request 2015-12-01 10:53:59 -07:00
blk-mq-tag.h blk-mq: factor out a helper to iterate all tags for a request_queue 2015-10-01 10:10:57 +02:00
blk-mq.c blk-mq: End unstarted requests on dying queue 2016-02-11 13:14:00 -07:00
blk-mq.h blk-mq: add a flags parameter to blk_mq_alloc_request 2015-12-01 10:53:59 -07:00
blk-settings.c block: Initialize max_dev_sectors to 0 2016-02-11 13:10:39 -07:00
blk-softirq.c
blk-sysfs.c blk: fix overflow in queue_discard_max_hw_show 2016-02-17 10:20:42 -07:00
blk-tag.c block: support different tag allocation policy 2015-01-23 14:15:46 -07:00
blk-throttle.c cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() 2015-09-18 11:56:28 -04:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk.h block: defer timeouts to a workqueue 2015-12-22 09:38:16 -07:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c
bsg.c block: Simplify bsg complete all 2015-02-04 09:57:52 -07:00
cfq-iosched.c cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() 2015-09-18 11:56:28 -04:00
cmdline-parser.c
compat_ioctl.c block, bdi: an active gendisk always has a request_queue associated with it 2014-09-08 10:00:35 -06:00
deadline-iosched.c deadline: remove unused struct member 2016-02-01 09:09:55 -07:00
elevator.c block: check bio_mergeable() early before merging 2015-10-21 15:00:54 -06:00
genhd.c Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block 2016-01-19 15:03:34 -08:00
ioctl.c block: revert runtime dax control of the raw block device 2016-01-30 13:35:31 -08:00
ioprio.c pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode 2015-11-06 17:50:42 -08:00
Kconfig block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00
Kconfig.iosched
Makefile Initial roundup of 4.5 merge window patches 2016-01-23 18:45:06 -08:00
noop-iosched.c elevator: use list_{first,prev,next}_entry 2015-11-16 15:21:48 -07:00
partition-generic.c block: use DAX for partition table reads 2016-01-30 13:35:32 -08:00
scsi_ioctl.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
t10-pi.c block: Consolidate static integrity profile properties 2015-10-21 14:42:38 -06:00