linux/drivers/md
Shaohua Li 851c30c9ba raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:46:38 +10:00
..
bcache Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block 2013-07-22 19:02:52 -07:00
persistent-data dm persistent metadata: add space map threshold callback 2013-05-10 14:37:20 +01:00
bitmap.c md: replace strict_strto*() with kstrto*() 2013-06-14 08:10:26 +10:00
bitmap.h md/bitmap: record the space available for the bitmap in the superblock. 2012-05-22 13:55:34 +10:00
dm-bio-prison.c dm: add cache target 2013-03-01 22:45:51 +00:00
dm-bio-prison.h dm: add cache target 2013-03-01 22:45:51 +00:00
dm-bio-record.h
dm-bufio.c dm bufio: submit writes outside lock 2013-07-10 23:41:18 +01:00
dm-bufio.h dm bufio: prefetch 2012-03-28 18:41:29 +01:00
dm-cache-block-types.h dm: add cache target 2013-03-01 22:45:51 +00:00
dm-cache-metadata.c dm cache: replace memcpy with struct assignment 2013-05-10 14:37:18 +01:00
dm-cache-metadata.h dm cache: policy ignore hints if generated by different version 2013-03-20 17:21:28 +00:00
dm-cache-policy-cleaner.c dm cache: policy change version from string to integer set 2013-03-20 17:21:27 +00:00
dm-cache-policy-internal.h dm cache: policy change version from string to integer set 2013-03-20 17:21:27 +00:00
dm-cache-policy-mq.c dm cache: avoid conflicting remove_mapping() in mq policy 2013-08-16 15:56:51 -04:00
dm-cache-policy.c dm cache: policy change version from string to integer set 2013-03-20 17:21:27 +00:00
dm-cache-policy.h dm cache policy: fix description of lookup fn 2013-05-10 14:37:17 +01:00
dm-cache-target.c dm cache: fix arm link errors with inline 2013-07-10 23:41:17 +01:00
dm-crypt.c block: Convert some code to bio_for_each_segment_all() 2013-03-23 14:26:30 -07:00
dm-delay.c dm: rename request variables to bios 2013-03-01 22:45:47 +00:00
dm-exception-store.c dm: replace simple_strtoul 2012-07-27 15:07:59 +01:00
dm-exception-store.h
dm-flakey.c dm flakey: correct ctr alloc failure mesg 2013-07-10 23:41:17 +01:00
dm-io.c dm kcopyd: add WRITE SAME support to dm_kcopyd_zero 2012-12-21 20:23:37 +00:00
dm-ioctl.c dm: optimize use SRCU and RCU 2013-07-10 23:41:18 +01:00
dm-kcopyd.c dm kcopyd: introduce configurable throttling 2013-03-01 22:45:49 +00:00
dm-linear.c dm: rename request variables to bios 2013-03-01 22:45:47 +00:00
dm-log-userspace-base.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
dm-log-userspace-transfer.c connector/userns: replace netlink uses of cap_raised() with capable() 2012-05-10 23:21:39 -04:00
dm-log-userspace-transfer.h
dm-log.c dm: use memweight() 2012-07-30 17:25:16 -07:00
dm-mpath.c dm mpath: fix ioctl deadlock when no paths 2013-07-10 23:41:15 +01:00
dm-mpath.h
dm-path-selector.c md: Add module.h to all files using it implicitly 2011-10-31 19:31:18 -04:00
dm-path-selector.h
dm-queue-length.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-raid1.c block: Use bio_sectors() more consistently 2013-03-23 14:15:30 -07:00
dm-raid.c MD: Remember the last sync operation that was performed 2013-06-26 12:38:24 +10:00
dm-region-hash.c dm raid1: fix crash with mirror recovery and discard 2012-07-20 14:25:03 +01:00
dm-round-robin.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-service-time.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-snap-persistent.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-snap-transient.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-snap.c dm snapshot: fix error return code in snapshot_ctr 2013-05-10 14:37:15 +01:00
dm-stripe.c dm stripe: fix regression in stripe_width calculation 2013-05-10 14:37:14 +01:00
dm-switch.c dm: add switch target 2013-07-10 23:41:19 +01:00
dm-sysfs.c
dm-table.c dm: optimize use SRCU and RCU 2013-07-10 23:41:18 +01:00
dm-target.c dm: rename request variables to bios 2013-03-01 22:45:47 +00:00
dm-thin-metadata.c dm thin: generate event when metadata threshold passed 2013-05-10 14:37:21 +01:00
dm-thin-metadata.h dm thin: generate event when metadata threshold passed 2013-05-10 14:37:21 +01:00
dm-thin.c dm thin: fix metadata dev resize detection 2013-05-19 18:57:50 +01:00
dm-uevent.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-uevent.h
dm-verity.c dm verity: use __ffs and __fls 2013-07-10 23:41:17 +01:00
dm-zero.c dm: rename request variables to bios 2013-03-01 22:45:47 +00:00
dm.c dm: optimize reorder structure 2013-07-10 23:41:18 +01:00
dm.h dm: introduce per_bio_data 2012-12-21 20:23:38 +00:00
faulty.c block: Add bio_end_sector() 2013-03-23 14:15:29 -07:00
Kconfig dm: add switch target 2013-07-10 23:41:19 +01:00
linear.c block: Add bio_end_sector() 2013-03-23 14:15:29 -07:00
linear.h md/linear: typedef removal: linear_conf_t -> struct linear_conf 2011-10-11 16:48:54 +11:00
Makefile dm: add switch target 2013-07-10 23:41:19 +01:00
md.c md: avoid deadlock when dirty buffers during md_stop. 2013-08-27 16:45:00 +10:00
md.h md: avoid deadlock when dirty buffers during md_stop. 2013-08-27 16:45:00 +10:00
multipath.c MD: change the parameter of md thread 2012-10-11 13:34:00 +11:00
multipath.h md/multipath: typedef removal: multipath_conf_t -> struct mpconf 2011-10-11 16:48:57 +11:00
raid0.c md: fix buglet in RAID5 -> RAID0 conversion. 2013-06-26 12:38:19 +10:00
raid0.h md: add proper merge_bvec handling to RAID0 and Linear. 2012-03-19 12:46:39 +11:00
raid1.c md/raid1: fix bio handling problems in process_checks() 2013-07-18 14:18:04 +10:00
raid1.h md/raid1: prevent merging too large request 2012-07-31 10:03:53 +10:00
raid5.c raid5: offload stripe handle to workqueue 2013-08-28 16:46:38 +10:00
raid5.h raid5: offload stripe handle to workqueue 2013-08-28 16:46:38 +10:00
raid10.c md/raid10: remove use-after-free bug. 2013-07-25 16:46:53 +10:00
raid10.h MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) 2013-02-26 11:55:30 +11:00