2005-04-16 15:20:36 -07:00
/*
* Copyright ( C ) 2001 , 2002 Sistina Software ( UK ) Limited .
2009-01-06 03:05:12 +00:00
* Copyright ( C ) 2004 - 2008 Red Hat , Inc . All rights reserved .
2005-04-16 15:20:36 -07:00
*
* This file is released under the GPL .
*/
2016-05-12 16:28:10 -04:00
# include "dm-core.h"
# include "dm-rq.h"
2007-10-19 22:48:00 +01:00
# include "dm-uevent.h"
2005-04-16 15:20:36 -07:00
# include <linux/init.h>
# include <linux/module.h>
2006-03-27 01:18:20 -08:00
# include <linux/mutex.h>
2005-04-16 15:20:36 -07:00
# include <linux/blkpg.h>
# include <linux/bio.h>
# include <linux/mempool.h>
# include <linux/slab.h>
# include <linux/idr.h>
2006-03-27 01:17:54 -08:00
# include <linux/hdreg.h>
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
# include <linux/delay.h>
2014-10-28 18:34:52 -04:00
# include <linux/wait.h>
2015-10-15 14:10:51 +02:00
# include <linux/pr.h>
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 13:43:05 +08:00
2006-06-26 00:27:35 -07:00
# define DM_MSG_PREFIX "core"
2011-10-31 20:18:54 +00:00
# ifdef CONFIG_PRINTK
/*
* ratelimit state to be used in DMXXX_LIMIT ( ) .
*/
DEFINE_RATELIMIT_STATE ( dm_ratelimit_state ,
DEFAULT_RATELIMIT_INTERVAL ,
DEFAULT_RATELIMIT_BURST ) ;
EXPORT_SYMBOL ( dm_ratelimit_state ) ;
# endif
2009-06-22 10:12:30 +01:00
/*
* Cookies are numeric values sent with CHANGE and REMOVE
* uevents while resuming , removing or renaming the device .
*/
# define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
# define DM_COOKIE_LENGTH 24
2005-04-16 15:20:36 -07:00
static const char * _name = DM_NAME ;
static unsigned int major = 0 ;
static unsigned int _major = 0 ;
2011-08-02 12:32:01 +01:00
static DEFINE_IDR ( _minor_idr ) ;
2006-06-26 00:27:22 -07:00
static DEFINE_SPINLOCK ( _minor_lock ) ;
2013-11-01 18:27:41 -04:00
static void do_deferred_remove ( struct work_struct * w ) ;
static DECLARE_WORK ( deferred_remove_work , do_deferred_remove ) ;
2014-06-14 13:44:31 -04:00
static struct workqueue_struct * deferred_remove_workqueue ;
2005-04-16 15:20:36 -07:00
/*
* One of these is allocated per bio .
*/
struct dm_io {
struct mapped_device * md ;
int error ;
atomic_t io_count ;
2008-07-21 12:00:28 +01:00
struct bio * bio ;
2006-02-01 03:04:53 -08:00
unsigned long start_time ;
2009-10-16 23:18:15 +01:00
spinlock_t endio_lock ;
2013-08-16 10:54:23 -04:00
struct dm_stats_aux stats_aux ;
2005-04-16 15:20:36 -07:00
} ;
2006-06-26 00:27:21 -07:00
# define MINOR_ALLOCED ((void *)-1)
2005-04-16 15:20:36 -07:00
/*
* Bits for the md - > flags field .
*/
2009-04-09 00:27:14 +01:00
# define DMF_BLOCK_IO_FOR_SUSPEND 0
2005-04-16 15:20:36 -07:00
# define DMF_SUSPENDED 1
2006-01-06 00:20:06 -08:00
# define DMF_FROZEN 2
2006-06-26 00:27:23 -07:00
# define DMF_FREEING 3
2006-06-26 00:27:34 -07:00
# define DMF_DELETING 4
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
# define DMF_NOFLUSH_SUSPENDING 5
2015-04-27 23:48:34 -07:00
# define DMF_DEFERRED_REMOVE 6
# define DMF_SUSPENDED_INTERNALLY 7
2005-04-16 15:20:36 -07:00
2016-02-22 12:16:21 -05:00
# define DM_NUMA_NODE NUMA_NO_NODE
static int dm_numa_node = DM_NUMA_NODE ;
2016-01-28 16:52:56 -05:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
/*
* For mempools pre - allocation at the table loading time .
*/
struct dm_md_mempools {
mempool_t * io_pool ;
2014-12-05 17:11:05 -05:00
mempool_t * rq_pool ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
struct bio_set * bs ;
} ;
2014-08-13 13:53:43 -05:00
struct table_device {
struct list_head list ;
atomic_t count ;
struct dm_dev dm_dev ;
} ;
2006-12-06 20:33:20 -08:00
static struct kmem_cache * _io_cache ;
2009-01-06 03:05:06 +00:00
static struct kmem_cache * _rq_tio_cache ;
2014-12-05 17:11:05 -05:00
static struct kmem_cache * _rq_cache ;
2012-09-07 13:44:01 -07:00
2013-09-12 18:06:12 -04:00
/*
* Bio - based DM ' s mempools ' reserved IOs set by the user .
*/
2016-05-12 16:28:10 -04:00
# define RESERVED_BIO_BASED_IOS 16
2013-09-12 18:06:12 -04:00
static unsigned reserved_bio_based_ios = RESERVED_BIO_BASED_IOS ;
2016-02-22 12:16:21 -05:00
static int __dm_get_module_param_int ( int * module_param , int min , int max )
{
int param = ACCESS_ONCE ( * module_param ) ;
int modified_param = 0 ;
bool modified = true ;
if ( param < min )
modified_param = min ;
else if ( param > max )
modified_param = max ;
else
modified = false ;
if ( modified ) {
( void ) cmpxchg ( module_param , param , modified_param ) ;
param = modified_param ;
}
return param ;
}
2016-05-12 16:28:10 -04:00
unsigned __dm_get_module_param ( unsigned * module_param ,
unsigned def , unsigned max )
2013-09-12 18:06:12 -04:00
{
2015-02-27 22:25:26 -05:00
unsigned param = ACCESS_ONCE ( * module_param ) ;
unsigned modified_param = 0 ;
2013-09-12 18:06:12 -04:00
2015-02-27 22:25:26 -05:00
if ( ! param )
modified_param = def ;
else if ( param > max )
modified_param = max ;
2013-09-12 18:06:12 -04:00
2015-02-27 22:25:26 -05:00
if ( modified_param ) {
( void ) cmpxchg ( module_param , param , modified_param ) ;
param = modified_param ;
2013-09-12 18:06:12 -04:00
}
2015-02-27 22:25:26 -05:00
return param ;
2013-09-12 18:06:12 -04:00
}
2013-09-12 18:06:12 -04:00
unsigned dm_get_reserved_bio_based_ios ( void )
{
2015-02-27 22:25:26 -05:00
return __dm_get_module_param ( & reserved_bio_based_ios ,
2016-05-12 16:28:10 -04:00
RESERVED_BIO_BASED_IOS , DM_RESERVED_MAX_IOS ) ;
2013-09-12 18:06:12 -04:00
}
EXPORT_SYMBOL_GPL ( dm_get_reserved_bio_based_ios ) ;
2016-02-22 12:16:21 -05:00
static unsigned dm_get_numa_node ( void )
{
return __dm_get_module_param_int ( & dm_numa_node ,
DM_NUMA_NODE , num_online_nodes ( ) - 1 ) ;
}
2005-04-16 15:20:36 -07:00
static int __init local_init ( void )
{
2008-10-21 17:45:08 +01:00
int r = - ENOMEM ;
2005-04-16 15:20:36 -07:00
/* allocate a slab for the dm_ios */
2007-07-12 17:26:32 +01:00
_io_cache = KMEM_CACHE ( dm_io , 0 ) ;
2005-04-16 15:20:36 -07:00
if ( ! _io_cache )
2008-10-21 17:45:08 +01:00
return r ;
2005-04-16 15:20:36 -07:00
2009-01-06 03:05:06 +00:00
_rq_tio_cache = KMEM_CACHE ( dm_rq_target_io , 0 ) ;
if ( ! _rq_tio_cache )
2012-10-12 21:02:15 +01:00
goto out_free_io_cache ;
2009-01-06 03:05:06 +00:00
2016-02-20 13:45:38 -05:00
_rq_cache = kmem_cache_create ( " dm_old_clone_request " , sizeof ( struct request ) ,
2014-12-05 17:11:05 -05:00
__alignof__ ( struct request ) , 0 , NULL ) ;
if ( ! _rq_cache )
goto out_free_rq_tio_cache ;
2007-10-19 22:48:00 +01:00
r = dm_uevent_init ( ) ;
2008-10-21 17:45:08 +01:00
if ( r )
2014-12-05 17:11:05 -05:00
goto out_free_rq_cache ;
2007-10-19 22:48:00 +01:00
2014-06-14 13:44:31 -04:00
deferred_remove_workqueue = alloc_workqueue ( " kdmremove " , WQ_UNBOUND , 1 ) ;
if ( ! deferred_remove_workqueue ) {
r = - ENOMEM ;
goto out_uevent_exit ;
}
2005-04-16 15:20:36 -07:00
_major = major ;
r = register_blkdev ( _major , _name ) ;
2008-10-21 17:45:08 +01:00
if ( r < 0 )
2014-06-14 13:44:31 -04:00
goto out_free_workqueue ;
2005-04-16 15:20:36 -07:00
if ( ! _major )
_major = r ;
return 0 ;
2008-10-21 17:45:08 +01:00
2014-06-14 13:44:31 -04:00
out_free_workqueue :
destroy_workqueue ( deferred_remove_workqueue ) ;
2008-10-21 17:45:08 +01:00
out_uevent_exit :
dm_uevent_exit ( ) ;
2014-12-05 17:11:05 -05:00
out_free_rq_cache :
kmem_cache_destroy ( _rq_cache ) ;
2009-01-06 03:05:06 +00:00
out_free_rq_tio_cache :
kmem_cache_destroy ( _rq_tio_cache ) ;
2008-10-21 17:45:08 +01:00
out_free_io_cache :
kmem_cache_destroy ( _io_cache ) ;
return r ;
2005-04-16 15:20:36 -07:00
}
static void local_exit ( void )
{
2013-11-01 18:27:41 -04:00
flush_scheduled_work ( ) ;
2014-06-14 13:44:31 -04:00
destroy_workqueue ( deferred_remove_workqueue ) ;
2013-11-01 18:27:41 -04:00
2014-12-05 17:11:05 -05:00
kmem_cache_destroy ( _rq_cache ) ;
2009-01-06 03:05:06 +00:00
kmem_cache_destroy ( _rq_tio_cache ) ;
2005-04-16 15:20:36 -07:00
kmem_cache_destroy ( _io_cache ) ;
2007-07-17 04:03:46 -07:00
unregister_blkdev ( _major , _name ) ;
2007-10-19 22:48:00 +01:00
dm_uevent_exit ( ) ;
2005-04-16 15:20:36 -07:00
_major = 0 ;
DMINFO ( " cleaned up " ) ;
}
2008-02-08 02:09:51 +00:00
static int ( * _inits [ ] ) ( void ) __initdata = {
2005-04-16 15:20:36 -07:00
local_init ,
dm_target_init ,
dm_linear_init ,
dm_stripe_init ,
2009-12-10 23:51:57 +00:00
dm_io_init ,
2008-04-24 21:43:49 +01:00
dm_kcopyd_init ,
2005-04-16 15:20:36 -07:00
dm_interface_init ,
2013-08-16 10:54:23 -04:00
dm_statistics_init ,
2005-04-16 15:20:36 -07:00
} ;
2008-02-08 02:09:51 +00:00
static void ( * _exits [ ] ) ( void ) = {
2005-04-16 15:20:36 -07:00
local_exit ,
dm_target_exit ,
dm_linear_exit ,
dm_stripe_exit ,
2009-12-10 23:51:57 +00:00
dm_io_exit ,
2008-04-24 21:43:49 +01:00
dm_kcopyd_exit ,
2005-04-16 15:20:36 -07:00
dm_interface_exit ,
2013-08-16 10:54:23 -04:00
dm_statistics_exit ,
2005-04-16 15:20:36 -07:00
} ;
static int __init dm_init ( void )
{
const int count = ARRAY_SIZE ( _inits ) ;
int r , i ;
for ( i = 0 ; i < count ; i + + ) {
r = _inits [ i ] ( ) ;
if ( r )
goto bad ;
}
return 0 ;
bad :
while ( i - - )
_exits [ i ] ( ) ;
return r ;
}
static void __exit dm_exit ( void )
{
int i = ARRAY_SIZE ( _exits ) ;
while ( i - - )
_exits [ i ] ( ) ;
2011-08-02 12:32:01 +01:00
/*
* Should be empty by this point .
*/
idr_destroy ( & _minor_idr ) ;
2005-04-16 15:20:36 -07:00
}
/*
* Block device functions
*/
2009-12-10 23:52:20 +00:00
int dm_deleting_md ( struct mapped_device * md )
{
return test_bit ( DMF_DELETING , & md - > flags ) ;
}
2008-03-02 10:29:31 -05:00
static int dm_blk_open ( struct block_device * bdev , fmode_t mode )
2005-04-16 15:20:36 -07:00
{
struct mapped_device * md ;
2006-06-26 00:27:23 -07:00
spin_lock ( & _minor_lock ) ;
2008-03-02 10:29:31 -05:00
md = bdev - > bd_disk - > private_data ;
2006-06-26 00:27:23 -07:00
if ( ! md )
goto out ;
2006-06-26 00:27:34 -07:00
if ( test_bit ( DMF_FREEING , & md - > flags ) | |
2009-12-10 23:52:20 +00:00
dm_deleting_md ( md ) ) {
2006-06-26 00:27:23 -07:00
md = NULL ;
goto out ;
}
2005-04-16 15:20:36 -07:00
dm_get ( md ) ;
2006-06-26 00:27:34 -07:00
atomic_inc ( & md - > open_count ) ;
2006-06-26 00:27:23 -07:00
out :
spin_unlock ( & _minor_lock ) ;
return md ? 0 : - ENXIO ;
2005-04-16 15:20:36 -07:00
}
2013-05-05 21:52:57 -04:00
static void dm_blk_close ( struct gendisk * disk , fmode_t mode )
2005-04-16 15:20:36 -07:00
{
2015-03-23 17:01:43 -04:00
struct mapped_device * md ;
2010-08-07 18:25:34 +02:00
2011-01-13 19:59:48 +00:00
spin_lock ( & _minor_lock ) ;
2015-03-23 17:01:43 -04:00
md = disk - > private_data ;
if ( WARN_ON ( ! md ) )
goto out ;
2013-11-01 18:27:41 -04:00
if ( atomic_dec_and_test ( & md - > open_count ) & &
( test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ) )
2014-06-14 13:44:31 -04:00
queue_work ( deferred_remove_workqueue , & deferred_remove_work ) ;
2013-11-01 18:27:41 -04:00
2005-04-16 15:20:36 -07:00
dm_put ( md ) ;
2015-03-23 17:01:43 -04:00
out :
2011-01-13 19:59:48 +00:00
spin_unlock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
}
2006-06-26 00:27:34 -07:00
int dm_open_count ( struct mapped_device * md )
{
return atomic_read ( & md - > open_count ) ;
}
/*
* Guarantees nothing is using the device before it ' s deleted .
*/
2013-11-01 18:27:41 -04:00
int dm_lock_for_deletion ( struct mapped_device * md , bool mark_deferred , bool only_deferred )
2006-06-26 00:27:34 -07:00
{
int r = 0 ;
spin_lock ( & _minor_lock ) ;
2013-11-01 18:27:41 -04:00
if ( dm_open_count ( md ) ) {
2006-06-26 00:27:34 -07:00
r = - EBUSY ;
2013-11-01 18:27:41 -04:00
if ( mark_deferred )
set_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
} else if ( only_deferred & & ! test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) )
r = - EEXIST ;
2006-06-26 00:27:34 -07:00
else
set_bit ( DMF_DELETING , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
return r ;
}
2013-11-01 18:27:41 -04:00
int dm_cancel_deferred_remove ( struct mapped_device * md )
{
int r = 0 ;
spin_lock ( & _minor_lock ) ;
if ( test_bit ( DMF_DELETING , & md - > flags ) )
r = - EBUSY ;
else
clear_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
return r ;
}
static void do_deferred_remove ( struct work_struct * w )
{
dm_deferred_remove ( ) ;
}
2013-08-16 10:54:23 -04:00
sector_t dm_get_size ( struct mapped_device * md )
{
return get_capacity ( md - > disk ) ;
}
2014-02-28 15:33:43 +01:00
struct request_queue * dm_get_md_queue ( struct mapped_device * md )
{
return md - > queue ;
}
2013-08-16 10:54:23 -04:00
struct dm_stats * dm_get_stats ( struct mapped_device * md )
{
return & md - > stats ;
}
2006-03-27 01:17:54 -08:00
static int dm_blk_getgeo ( struct block_device * bdev , struct hd_geometry * geo )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
return dm_get_geometry ( md , geo ) ;
}
2016-02-18 16:13:51 -05:00
static int dm_grab_bdev_for_ioctl ( struct mapped_device * md ,
struct block_device * * bdev ,
fmode_t * mode )
2006-10-03 01:15:15 -07:00
{
2016-02-18 15:44:39 -05:00
struct dm_target * tgt ;
2013-07-10 23:41:15 +01:00
struct dm_table * map ;
2016-02-18 16:13:51 -05:00
int srcu_idx , r ;
2006-10-03 01:15:15 -07:00
2013-07-10 23:41:15 +01:00
retry :
2015-10-15 14:10:50 +02:00
r = - ENOTTY ;
2016-02-18 16:13:51 -05:00
map = dm_get_live_table ( md , & srcu_idx ) ;
2006-10-03 01:15:15 -07:00
if ( ! map | | ! dm_table_get_size ( map ) )
goto out ;
/* We only support devices that have a single target */
if ( dm_table_get_num_targets ( map ) ! = 1 )
goto out ;
2016-02-18 15:44:39 -05:00
tgt = dm_table_get_target ( map , 0 ) ;
if ( ! tgt - > type - > prepare_ioctl )
2014-11-16 14:21:47 -05:00
goto out ;
2006-10-03 01:15:15 -07:00
2009-12-10 23:52:26 +00:00
if ( dm_suspended_md ( md ) ) {
2006-10-03 01:15:15 -07:00
r = - EAGAIN ;
goto out ;
}
2016-02-18 15:44:39 -05:00
r = tgt - > type - > prepare_ioctl ( tgt , bdev , mode ) ;
2015-10-15 14:10:50 +02:00
if ( r < 0 )
goto out ;
2006-10-03 01:15:15 -07:00
2016-02-18 16:13:51 -05:00
bdgrab ( * bdev ) ;
dm_put_live_table ( md , srcu_idx ) ;
2015-10-15 14:10:50 +02:00
return r ;
2006-10-03 01:15:15 -07:00
out :
2016-02-18 16:13:51 -05:00
dm_put_live_table ( md , srcu_idx ) ;
2015-11-17 09:39:26 +00:00
if ( r = = - ENOTCONN & & ! fatal_signal_pending ( current ) ) {
2013-07-10 23:41:15 +01:00
msleep ( 10 ) ;
goto retry ;
}
2015-10-15 14:10:50 +02:00
return r ;
}
static int dm_blk_ioctl ( struct block_device * bdev , fmode_t mode ,
unsigned int cmd , unsigned long arg )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
2016-02-18 16:13:51 -05:00
int r ;
2015-10-15 14:10:50 +02:00
2016-02-18 16:13:51 -05:00
r = dm_grab_bdev_for_ioctl ( md , & bdev , & mode ) ;
2015-10-15 14:10:50 +02:00
if ( r < 0 )
return r ;
2013-07-10 23:41:15 +01:00
2015-10-15 14:10:50 +02:00
if ( r > 0 ) {
/*
* Target determined this ioctl is being issued against
* a logical partition of the parent bdev ; so extra
* validation is needed .
*/
r = scsi_verify_blk_ioctl ( NULL , cmd ) ;
if ( r )
goto out ;
}
2013-07-10 23:41:15 +01:00
2016-02-18 15:44:39 -05:00
r = __blkdev_driver_ioctl ( bdev , mode , cmd , arg ) ;
2015-10-15 14:10:50 +02:00
out :
2016-02-18 16:13:51 -05:00
bdput ( bdev ) ;
2006-10-03 01:15:15 -07:00
return r ;
}
2007-07-12 17:26:32 +01:00
static struct dm_io * alloc_io ( struct mapped_device * md )
2005-04-16 15:20:36 -07:00
{
return mempool_alloc ( md - > io_pool , GFP_NOIO ) ;
}
2007-07-12 17:26:32 +01:00
static void free_io ( struct mapped_device * md , struct dm_io * io )
2005-04-16 15:20:36 -07:00
{
mempool_free ( io , md - > io_pool ) ;
}
2016-04-11 12:05:38 -04:00
static void free_tio ( struct dm_target_io * tio )
2005-04-16 15:20:36 -07:00
{
2012-10-12 21:02:15 +01:00
bio_put ( & tio - > clone ) ;
2005-04-16 15:20:36 -07:00
}
2016-05-12 16:28:10 -04:00
int md_in_flight ( struct mapped_device * md )
2009-12-10 23:52:13 +00:00
{
return atomic_read ( & md - > pending [ READ ] ) +
atomic_read ( & md - > pending [ WRITE ] ) ;
}
2006-02-01 03:04:53 -08:00
static void start_io_acct ( struct dm_io * io )
{
struct mapped_device * md = io - > md ;
2013-08-16 10:54:23 -04:00
struct bio * bio = io - > bio ;
2008-08-25 19:47:21 +09:00
int cpu ;
2013-08-16 10:54:23 -04:00
int rw = bio_data_dir ( bio ) ;
2006-02-01 03:04:53 -08:00
io - > start_time = jiffies ;
2008-08-25 19:56:14 +09:00
cpu = part_stat_lock ( ) ;
part_round_stats ( cpu , & dm_disk ( md ) - > part0 ) ;
part_stat_unlock ( ) ;
2011-03-22 08:35:35 +01:00
atomic_set ( & dm_disk ( md ) - > part0 . in_flight [ rw ] ,
atomic_inc_return ( & md - > pending [ rw ] ) ) ;
2013-08-16 10:54:23 -04:00
if ( unlikely ( dm_stats_used ( & md - > stats ) ) )
2016-06-05 14:32:03 -05:00
dm_stats_account_io ( & md - > stats , bio_data_dir ( bio ) ,
bio - > bi_iter . bi_sector , bio_sectors ( bio ) ,
false , 0 , & io - > stats_aux ) ;
2006-02-01 03:04:53 -08:00
}
2008-11-13 23:39:10 +00:00
static void end_io_acct ( struct dm_io * io )
2006-02-01 03:04:53 -08:00
{
struct mapped_device * md = io - > md ;
struct bio * bio = io - > bio ;
unsigned long duration = jiffies - io - > start_time ;
2014-11-24 11:05:26 +08:00
int pending ;
2006-02-01 03:04:53 -08:00
int rw = bio_data_dir ( bio ) ;
2014-11-24 11:05:26 +08:00
generic_end_io_acct ( rw , & dm_disk ( md ) - > part0 , io - > start_time ) ;
2006-02-01 03:04:53 -08:00
2013-08-16 10:54:23 -04:00
if ( unlikely ( dm_stats_used ( & md - > stats ) ) )
2016-06-05 14:32:03 -05:00
dm_stats_account_io ( & md - > stats , bio_data_dir ( bio ) ,
bio - > bi_iter . bi_sector , bio_sectors ( bio ) ,
true , duration , & io - > stats_aux ) ;
2013-08-16 10:54:23 -04:00
2009-04-09 00:27:16 +01:00
/*
* After this is decremented the bio must not be touched if it is
2010-09-03 11:56:19 +02:00
* a flush .
2009-04-09 00:27:16 +01:00
*/
2011-03-22 08:35:35 +01:00
pending = atomic_dec_return ( & md - > pending [ rw ] ) ;
atomic_set ( & dm_disk ( md ) - > part0 . in_flight [ rw ] , pending ) ;
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 20:16:55 +02:00
pending + = atomic_read ( & md - > pending [ rw ^ 0x1 ] ) ;
2006-02-01 03:04:53 -08:00
2008-11-13 23:39:10 +00:00
/* nudge anyone waiting on suspend queue */
if ( ! pending )
wake_up ( & md - > wait ) ;
2006-02-01 03:04:53 -08:00
}
2005-04-16 15:20:36 -07:00
/*
* Add the bio to the list of deferred io .
*/
2009-04-09 00:27:15 +01:00
static void queue_io ( struct mapped_device * md , struct bio * bio )
2005-04-16 15:20:36 -07:00
{
2010-09-08 18:07:01 +02:00
unsigned long flags ;
2005-04-16 15:20:36 -07:00
2010-09-08 18:07:01 +02:00
spin_lock_irqsave ( & md - > deferred_lock , flags ) ;
2005-04-16 15:20:36 -07:00
bio_list_add ( & md - > deferred , bio ) ;
2010-09-08 18:07:01 +02:00
spin_unlock_irqrestore ( & md - > deferred_lock , flags ) ;
2010-09-08 18:07:00 +02:00
queue_work ( md - > wq , & md - > work ) ;
2005-04-16 15:20:36 -07:00
}
/*
* Everyone ( including functions in this file ) , should use this
* function to access the md - > map field , and make sure they call
2013-07-10 23:41:18 +01:00
* dm_put_live_table ( ) when finished .
2005-04-16 15:20:36 -07:00
*/
2013-07-10 23:41:18 +01:00
struct dm_table * dm_get_live_table ( struct mapped_device * md , int * srcu_idx ) __acquires ( md - > io_barrier )
2005-04-16 15:20:36 -07:00
{
2013-07-10 23:41:18 +01:00
* srcu_idx = srcu_read_lock ( & md - > io_barrier ) ;
return srcu_dereference ( md - > map , & md - > io_barrier ) ;
}
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
void dm_put_live_table ( struct mapped_device * md , int srcu_idx ) __releases ( md - > io_barrier )
{
srcu_read_unlock ( & md - > io_barrier , srcu_idx ) ;
}
void dm_sync_table ( struct mapped_device * md )
{
synchronize_srcu ( & md - > io_barrier ) ;
synchronize_rcu_expedited ( ) ;
}
/*
* A fast alternative to dm_get_live_table / dm_put_live_table .
* The caller must not block between these two functions .
*/
static struct dm_table * dm_get_live_table_fast ( struct mapped_device * md ) __acquires ( RCU )
{
rcu_read_lock ( ) ;
return rcu_dereference ( md - > map ) ;
}
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
static void dm_put_live_table_fast ( struct mapped_device * md ) __releases ( RCU )
{
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
}
2014-08-13 13:53:43 -05:00
/*
* Open a table device so we can use it as a map destination .
*/
static int open_table_device ( struct table_device * td , dev_t dev ,
struct mapped_device * md )
{
static char * _claim_ptr = " I belong to device-mapper " ;
struct block_device * bdev ;
int r ;
BUG_ON ( td - > dm_dev . bdev ) ;
bdev = blkdev_get_by_dev ( dev , td - > dm_dev . mode | FMODE_EXCL , _claim_ptr ) ;
if ( IS_ERR ( bdev ) )
return PTR_ERR ( bdev ) ;
r = bd_link_disk_holder ( bdev , dm_disk ( md ) ) ;
if ( r ) {
blkdev_put ( bdev , td - > dm_dev . mode | FMODE_EXCL ) ;
return r ;
}
td - > dm_dev . bdev = bdev ;
return 0 ;
}
/*
* Close a table device that we ' ve been using .
*/
static void close_table_device ( struct table_device * td , struct mapped_device * md )
{
if ( ! td - > dm_dev . bdev )
return ;
bd_unlink_disk_holder ( td - > dm_dev . bdev , dm_disk ( md ) ) ;
blkdev_put ( td - > dm_dev . bdev , td - > dm_dev . mode | FMODE_EXCL ) ;
td - > dm_dev . bdev = NULL ;
}
static struct table_device * find_table_device ( struct list_head * l , dev_t dev ,
fmode_t mode ) {
struct table_device * td ;
list_for_each_entry ( td , l , list )
if ( td - > dm_dev . bdev - > bd_dev = = dev & & td - > dm_dev . mode = = mode )
return td ;
return NULL ;
}
int dm_get_table_device ( struct mapped_device * md , dev_t dev , fmode_t mode ,
struct dm_dev * * result ) {
int r ;
struct table_device * td ;
mutex_lock ( & md - > table_devices_lock ) ;
td = find_table_device ( & md - > table_devices , dev , mode ) ;
if ( ! td ) {
2016-02-22 12:16:21 -05:00
td = kmalloc_node ( sizeof ( * td ) , GFP_KERNEL , md - > numa_node_id ) ;
2014-08-13 13:53:43 -05:00
if ( ! td ) {
mutex_unlock ( & md - > table_devices_lock ) ;
return - ENOMEM ;
}
td - > dm_dev . mode = mode ;
td - > dm_dev . bdev = NULL ;
if ( ( r = open_table_device ( td , dev , md ) ) ) {
mutex_unlock ( & md - > table_devices_lock ) ;
kfree ( td ) ;
return r ;
}
format_dev_t ( td - > dm_dev . name , dev ) ;
atomic_set ( & td - > count , 0 ) ;
list_add ( & td - > list , & md - > table_devices ) ;
}
atomic_inc ( & td - > count ) ;
mutex_unlock ( & md - > table_devices_lock ) ;
* result = & td - > dm_dev ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( dm_get_table_device ) ;
void dm_put_table_device ( struct mapped_device * md , struct dm_dev * d )
{
struct table_device * td = container_of ( d , struct table_device , dm_dev ) ;
mutex_lock ( & md - > table_devices_lock ) ;
if ( atomic_dec_and_test ( & td - > count ) ) {
close_table_device ( td , md ) ;
list_del ( & td - > list ) ;
kfree ( td ) ;
}
mutex_unlock ( & md - > table_devices_lock ) ;
}
EXPORT_SYMBOL ( dm_put_table_device ) ;
static void free_table_devices ( struct list_head * devices )
{
struct list_head * tmp , * next ;
list_for_each_safe ( tmp , next , devices ) {
struct table_device * td = list_entry ( tmp , struct table_device , list ) ;
DMWARN ( " dm_destroy: %s still exists with %d references " ,
td - > dm_dev . name , atomic_read ( & td - > count ) ) ;
kfree ( td ) ;
}
}
2006-03-27 01:17:54 -08:00
/*
* Get the geometry associated with a dm device
*/
int dm_get_geometry ( struct mapped_device * md , struct hd_geometry * geo )
{
* geo = md - > geometry ;
return 0 ;
}
/*
* Set the geometry of a device .
*/
int dm_set_geometry ( struct mapped_device * md , struct hd_geometry * geo )
{
sector_t sz = ( sector_t ) geo - > cylinders * geo - > heads * geo - > sectors ;
if ( geo - > start > sz ) {
DMWARN ( " Start sector is beyond the geometry limits. " ) ;
return - EINVAL ;
}
md - > geometry = * geo ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
/*-----------------------------------------------------------------
* CRUD START :
* A more elegant soln is in the works that uses the queue
* merge fn , unfortunately there are a couple of changes to
* the block layer that I want to make for this . So in the
* interests of getting something for people to use I give
* you this clearly demarcated crap .
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
static int __noflush_suspending ( struct mapped_device * md )
{
return test_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
}
2005-04-16 15:20:36 -07:00
/*
* Decrements the number of outstanding ios that a bio has been
* cloned into , completing the original io if necc .
*/
2006-01-14 13:20:43 -08:00
static void dec_pending ( struct dm_io * io , int error )
2005-04-16 15:20:36 -07:00
{
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
unsigned long flags ;
2009-03-16 17:44:36 +00:00
int io_error ;
struct bio * bio ;
struct mapped_device * md = io - > md ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
/* Push-back supersedes any I/O errors */
2009-10-16 23:18:15 +01:00
if ( unlikely ( error ) ) {
spin_lock_irqsave ( & io - > endio_lock , flags ) ;
if ( ! ( io - > error > 0 & & __noflush_suspending ( md ) ) )
io - > error = error ;
spin_unlock_irqrestore ( & io - > endio_lock , flags ) ;
}
2005-04-16 15:20:36 -07:00
if ( atomic_dec_and_test ( & io - > io_count ) ) {
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
if ( io - > error = = DM_ENDIO_REQUEUE ) {
/*
* Target requested pushing back the I / O .
*/
2009-04-02 19:55:39 +01:00
spin_lock_irqsave ( & md - > deferred_lock , flags ) ;
2010-09-08 18:07:00 +02:00
if ( __noflush_suspending ( md ) )
bio_list_add_head ( & md - > deferred , io - > bio ) ;
else
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
/* noflush suspend was interrupted. */
io - > error = - EIO ;
2009-04-02 19:55:39 +01:00
spin_unlock_irqrestore ( & md - > deferred_lock , flags ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
}
2009-03-16 17:44:36 +00:00
io_error = io - > error ;
bio = io - > bio ;
2010-09-08 18:07:00 +02:00
end_io_acct ( io ) ;
free_io ( md , io ) ;
if ( io_error = = DM_ENDIO_REQUEUE )
return ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
2016-06-05 14:32:25 -05:00
if ( ( bio - > bi_rw & REQ_PREFLUSH ) & & bio - > bi_iter . bi_size ) {
2009-04-09 00:27:16 +01:00
/*
2010-09-08 18:07:00 +02:00
* Preflush done for flush with data , reissue
2016-06-05 14:32:25 -05:00
* without REQ_PREFLUSH .
2009-04-09 00:27:16 +01:00
*/
2016-06-05 14:32:25 -05:00
bio - > bi_rw & = ~ REQ_PREFLUSH ;
2010-09-08 18:07:00 +02:00
queue_io ( md , bio ) ;
2009-04-09 00:27:16 +01:00
} else {
2010-09-08 18:07:01 +02:00
/* done with normal IO or empty flush */
2013-04-18 09:00:26 -07:00
trace_block_bio_complete ( md - > queue , bio , io_error ) ;
2015-07-20 15:29:37 +02:00
bio - > bi_error = io_error ;
bio_endio ( bio ) ;
2009-03-16 17:44:36 +00:00
}
2005-04-16 15:20:36 -07:00
}
}
2016-05-12 16:28:10 -04:00
void disable_write_same ( struct mapped_device * md )
2014-06-02 15:50:06 -04:00
{
struct queue_limits * limits = dm_get_queue_limits ( md ) ;
/* device doesn't really support WRITE SAME, disable it */
limits - > max_write_same_sectors = 0 ;
}
2015-07-20 15:29:37 +02:00
static void clone_endio ( struct bio * bio )
2005-04-16 15:20:36 -07:00
{
2015-07-20 15:29:37 +02:00
int error = bio - > bi_error ;
2014-12-17 14:37:04 +08:00
int r = error ;
2014-03-04 18:24:49 -05:00
struct dm_target_io * tio = container_of ( bio , struct dm_target_io , clone ) ;
2009-03-16 17:44:36 +00:00
struct dm_io * io = tio - > io ;
2006-10-03 01:15:41 -07:00
struct mapped_device * md = tio - > io - > md ;
2005-04-16 15:20:36 -07:00
dm_endio_fn endio = tio - > ti - > type - > end_io ;
if ( endio ) {
2012-12-21 20:23:41 +00:00
r = endio ( tio - > ti , bio , error ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
if ( r < 0 | | r = = DM_ENDIO_REQUEUE )
/*
* error and requeue request are handled
* in dec_pending ( ) .
*/
2005-04-16 15:20:36 -07:00
error = r ;
2006-12-08 02:41:05 -08:00
else if ( r = = DM_ENDIO_INCOMPLETE )
/* The target will handle the io */
2007-09-27 12:47:43 +02:00
return ;
2006-12-08 02:41:05 -08:00
else if ( r ) {
DMWARN ( " unimplemented target endio return value: %d " , r ) ;
BUG ( ) ;
}
2005-04-16 15:20:36 -07:00
}
2016-06-05 14:32:04 -05:00
if ( unlikely ( r = = - EREMOTEIO & & ( bio_op ( bio ) = = REQ_OP_WRITE_SAME ) & &
2014-06-02 15:50:06 -04:00
! bdev_get_queue ( bio - > bi_bdev ) - > limits . max_write_same_sectors ) )
disable_write_same ( md ) ;
2016-04-11 12:05:38 -04:00
free_tio ( tio ) ;
2009-03-16 17:44:36 +00:00
dec_pending ( io , error ) ;
2005-04-16 15:20:36 -07:00
}
2010-08-12 04:14:10 +01:00
/*
* Return maximum size of I / O possible at the supplied sector up to the current
* target boundary .
*/
static sector_t max_io_len_target_boundary ( sector_t sector , struct dm_target * ti )
{
sector_t target_offset = dm_target_offset ( ti , sector ) ;
return ti - > len - target_offset ;
}
static sector_t max_io_len ( sector_t sector , struct dm_target * ti )
2005-04-16 15:20:36 -07:00
{
2010-08-12 04:14:10 +01:00
sector_t len = max_io_len_target_boundary ( sector , ti ) ;
2012-07-27 15:08:00 +01:00
sector_t offset , max_len ;
2005-04-16 15:20:36 -07:00
/*
2012-07-27 15:08:00 +01:00
* Does the target need to split even further ?
2005-04-16 15:20:36 -07:00
*/
2012-07-27 15:08:00 +01:00
if ( ti - > max_io_len ) {
offset = dm_target_offset ( ti , sector ) ;
if ( unlikely ( ti - > max_io_len & ( ti - > max_io_len - 1 ) ) )
max_len = sector_div ( offset , ti - > max_io_len ) ;
else
max_len = offset & ( ti - > max_io_len - 1 ) ;
max_len = ti - > max_io_len - max_len ;
if ( len > max_len )
len = max_len ;
2005-04-16 15:20:36 -07:00
}
return len ;
}
2012-07-27 15:08:00 +01:00
int dm_set_target_max_io_len ( struct dm_target * ti , sector_t len )
{
if ( len > UINT_MAX ) {
DMERR ( " Specified maximum size of target IO (%llu) exceeds limit (%u) " ,
( unsigned long long ) len , UINT_MAX ) ;
ti - > error = " Maximum size of target IO is too large " ;
return - EINVAL ;
}
ti - > max_io_len = ( uint32_t ) len ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( dm_set_target_max_io_len ) ;
2016-06-22 17:54:53 -06:00
static long dm_blk_direct_access ( struct block_device * bdev , sector_t sector ,
libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
2016-07-28 17:22:07 -07:00
void * * kaddr , pfn_t * pfn , long size )
2016-06-22 17:54:53 -06:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
struct dm_table * map ;
struct dm_target * ti ;
int srcu_idx ;
long len , ret = - EIO ;
map = dm_get_live_table ( md , & srcu_idx ) ;
if ( ! map )
goto out ;
ti = dm_table_find_target ( map , sector ) ;
if ( ! dm_target_is_valid ( ti ) )
goto out ;
len = max_io_len ( sector , ti ) < < SECTOR_SHIFT ;
size = min ( len , size ) ;
if ( ti - > type - > direct_access )
ret = ti - > type - > direct_access ( ti , sector , kaddr , pfn , size ) ;
out :
dm_put_live_table ( md , srcu_idx ) ;
return min ( ret , size ) ;
}
2014-03-14 18:41:24 -04:00
/*
* A target may call dm_accept_partial_bio only from the map routine . It is
2016-06-05 14:32:25 -05:00
* allowed for all bio types except REQ_PREFLUSH .
2014-03-14 18:41:24 -04:00
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
* sent in a next bio .
*
* A diagram that explains the arithmetics :
* + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - - - - - - +
* | 1 | 2 | 3 |
* + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - - - - - - +
*
* < - - - - - - - - - - - - - - * tio - > len_ptr - - - - - - - - - - - - - - - >
* < - - - - - - - bi_size - - - - - - - >
* < - - n_sectors - - >
*
* Region 1 was already iterated over with bio_advance or similar function .
* ( it may be empty if the target doesn ' t use bio_advance )
* Region 2 is the remaining bio size that the target wants to process .
* ( it may be empty if region 1 is non - empty , although there is no reason
* to make it empty )
* The target requires that region 3 is to be sent in the next bio .
*
* If the target wants to receive multiple copies of the bio ( via num_ * bios , etc ) ,
* the partially processed part ( the sum of regions 1 + 2 ) must be the same for all
* copies of the bio .
*/
void dm_accept_partial_bio ( struct bio * bio , unsigned n_sectors )
{
struct dm_target_io * tio = container_of ( bio , struct dm_target_io , clone ) ;
unsigned bi_size = bio - > bi_iter . bi_size > > SECTOR_SHIFT ;
2016-06-05 14:32:25 -05:00
BUG_ON ( bio - > bi_rw & REQ_PREFLUSH ) ;
2014-03-14 18:41:24 -04:00
BUG_ON ( bi_size > * tio - > len_ptr ) ;
BUG_ON ( n_sectors > bi_size ) ;
* tio - > len_ptr - = bi_size - n_sectors ;
bio - > bi_iter . bi_size = n_sectors < < SECTOR_SHIFT ;
}
EXPORT_SYMBOL_GPL ( dm_accept_partial_bio ) ;
2013-03-01 22:45:46 +00:00
static void __map_bio ( struct dm_target_io * tio )
2005-04-16 15:20:36 -07:00
{
int r ;
2006-03-23 20:00:26 +01:00
sector_t sector ;
2012-10-12 21:02:15 +01:00
struct bio * clone = & tio - > clone ;
2013-03-01 22:45:46 +00:00
struct dm_target * ti = tio - > ti ;
2005-04-16 15:20:36 -07:00
clone - > bi_end_io = clone_endio ;
/*
* Map the clone . If r = = 0 we don ' t need to do
* anything , the target has assumed ownership of
* this io .
*/
atomic_inc ( & tio - > io - > io_count ) ;
2013-10-11 15:44:27 -07:00
sector = clone - > bi_iter . bi_sector ;
2012-12-21 20:23:41 +00:00
r = ti - > type - > map ( ti , clone ) ;
2006-12-08 02:41:05 -08:00
if ( r = = DM_MAPIO_REMAPPED ) {
2005-04-16 15:20:36 -07:00
/* the bio has been remapped so dispatch it */
2006-03-23 20:00:26 +01:00
2010-11-16 12:52:38 +01:00
trace_block_bio_remap ( bdev_get_queue ( clone - > bi_bdev ) , clone ,
tio - > io - > bio - > bi_bdev - > bd_dev , sector ) ;
2006-03-23 20:00:26 +01:00
2005-04-16 15:20:36 -07:00
generic_make_request ( clone ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
} else if ( r < 0 | | r = = DM_MAPIO_REQUEUE ) {
/* error the io and bail out, or requeue it if needed */
2006-10-03 01:15:41 -07:00
dec_pending ( tio - > io , r ) ;
2016-04-11 12:05:38 -04:00
free_tio ( tio ) ;
2015-07-01 17:30:36 -04:00
} else if ( r ! = DM_MAPIO_SUBMITTED ) {
2006-12-08 02:41:05 -08:00
DMWARN ( " unimplemented target map return value: %d " , r ) ;
BUG ( ) ;
2005-04-16 15:20:36 -07:00
}
}
struct clone_info {
struct mapped_device * md ;
struct dm_table * map ;
struct bio * bio ;
struct dm_io * io ;
sector_t sector ;
2014-03-14 18:40:39 -04:00
unsigned sector_count ;
2005-04-16 15:20:36 -07:00
} ;
2014-03-14 18:40:39 -04:00
static void bio_setup_sector ( struct bio * bio , sector_t sector , unsigned len )
2013-03-01 22:45:46 +00:00
{
2013-10-11 15:44:27 -07:00
bio - > bi_iter . bi_sector = sector ;
bio - > bi_iter . bi_size = to_bytes ( len ) ;
2005-04-16 15:20:36 -07:00
}
/*
* Creates a bio that consists of range of complete bvecs .
*/
2016-03-02 12:33:03 -05:00
static int clone_bio ( struct dm_target_io * tio , struct bio * bio ,
sector_t sector , unsigned len )
2005-04-16 15:20:36 -07:00
{
2012-10-12 21:02:15 +01:00
struct bio * clone = & tio - > clone ;
2005-04-16 15:20:36 -07:00
2013-10-29 17:17:49 -07:00
__bio_clone_fast ( clone , bio ) ;
2016-03-02 12:33:03 -05:00
if ( bio_integrity ( bio ) ) {
int r = bio_integrity_clone ( clone , bio , GFP_NOIO ) ;
if ( r < 0 )
return r ;
}
2013-03-01 22:45:46 +00:00
2013-10-29 17:17:49 -07:00
bio_advance ( clone , to_bytes ( sector - clone - > bi_iter . bi_sector ) ) ;
clone - > bi_iter . bi_size = to_bytes ( len ) ;
if ( bio_integrity ( bio ) )
bio_integrity_trim ( clone , 0 , len ) ;
2016-03-02 12:33:03 -05:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2009-06-22 10:12:21 +01:00
static struct dm_target_io * alloc_tio ( struct clone_info * ci ,
2014-10-03 11:55:16 +00:00
struct dm_target * ti ,
2013-03-01 22:45:47 +00:00
unsigned target_bio_nr )
2009-06-22 10:12:20 +01:00
{
2012-10-12 21:02:15 +01:00
struct dm_target_io * tio ;
struct bio * clone ;
2014-10-03 11:55:16 +00:00
clone = bio_alloc_bioset ( GFP_NOIO , 0 , ci - > md - > bs ) ;
2012-10-12 21:02:15 +01:00
tio = container_of ( clone , struct dm_target_io , clone ) ;
2009-06-22 10:12:20 +01:00
tio - > io = ci - > io ;
tio - > ti = ti ;
2013-03-01 22:45:47 +00:00
tio - > target_bio_nr = target_bio_nr ;
2009-06-22 10:12:21 +01:00
return tio ;
}
2013-03-01 22:45:47 +00:00
static void __clone_and_map_simple_bio ( struct clone_info * ci ,
struct dm_target * ti ,
2014-03-14 18:41:24 -04:00
unsigned target_bio_nr , unsigned * len )
2009-06-22 10:12:21 +01:00
{
2014-10-03 11:55:16 +00:00
struct dm_target_io * tio = alloc_tio ( ci , ti , target_bio_nr ) ;
2012-10-12 21:02:15 +01:00
struct bio * clone = & tio - > clone ;
2009-06-22 10:12:21 +01:00
2014-03-14 18:41:24 -04:00
tio - > len_ptr = len ;
2014-10-03 11:55:16 +00:00
__bio_clone_fast ( clone , ci - > bio ) ;
2013-03-01 22:45:46 +00:00
if ( len )
2014-03-14 18:41:24 -04:00
bio_setup_sector ( clone , ci - > sector , * len ) ;
2009-06-22 10:12:20 +01:00
2013-03-01 22:45:46 +00:00
__map_bio ( tio ) ;
2009-06-22 10:12:20 +01:00
}
2013-03-01 22:45:47 +00:00
static void __send_duplicate_bios ( struct clone_info * ci , struct dm_target * ti ,
2014-03-14 18:41:24 -04:00
unsigned num_bios , unsigned * len )
2010-08-12 04:14:09 +01:00
{
2013-03-01 22:45:47 +00:00
unsigned target_bio_nr ;
2010-08-12 04:14:09 +01:00
2013-03-01 22:45:47 +00:00
for ( target_bio_nr = 0 ; target_bio_nr < num_bios ; target_bio_nr + + )
2013-03-01 22:45:47 +00:00
__clone_and_map_simple_bio ( ci , ti , target_bio_nr , len ) ;
2010-08-12 04:14:09 +01:00
}
2013-03-01 22:45:47 +00:00
static int __send_empty_flush ( struct clone_info * ci )
2009-06-22 10:12:20 +01:00
{
2010-08-12 04:14:09 +01:00
unsigned target_nr = 0 ;
2009-06-22 10:12:20 +01:00
struct dm_target * ti ;
2010-09-08 18:07:01 +02:00
BUG_ON ( bio_has_data ( ci - > bio ) ) ;
2009-06-22 10:12:20 +01:00
while ( ( ti = dm_table_get_target ( ci - > map , target_nr + + ) ) )
2014-03-14 18:41:24 -04:00
__send_duplicate_bios ( ci , ti , ti - > num_flush_bios , NULL ) ;
2009-06-22 10:12:20 +01:00
return 0 ;
}
2016-03-02 12:33:03 -05:00
static int __clone_and_map_data_bio ( struct clone_info * ci , struct dm_target * ti ,
2014-03-14 18:41:24 -04:00
sector_t sector , unsigned * len )
2010-08-12 04:14:08 +01:00
{
2012-10-12 21:02:15 +01:00
struct bio * bio = ci - > bio ;
2010-08-12 04:14:08 +01:00
struct dm_target_io * tio ;
2013-03-01 22:45:49 +00:00
unsigned target_bio_nr ;
unsigned num_target_bios = 1 ;
2016-03-02 12:33:03 -05:00
int r = 0 ;
2010-08-12 04:14:08 +01:00
2013-03-01 22:45:49 +00:00
/*
* Does the target want to receive duplicate copies of the bio ?
*/
if ( bio_data_dir ( bio ) = = WRITE & & ti - > num_write_bios )
num_target_bios = ti - > num_write_bios ( ti , bio ) ;
2013-03-01 22:45:47 +00:00
2013-03-01 22:45:49 +00:00
for ( target_bio_nr = 0 ; target_bio_nr < num_target_bios ; target_bio_nr + + ) {
2014-10-03 11:55:16 +00:00
tio = alloc_tio ( ci , ti , target_bio_nr ) ;
2014-03-14 18:41:24 -04:00
tio - > len_ptr = len ;
2016-03-02 12:33:03 -05:00
r = clone_bio ( tio , bio , sector , * len ) ;
2016-04-09 12:48:18 -04:00
if ( r < 0 ) {
2016-04-11 12:05:38 -04:00
free_tio ( tio ) ;
2016-03-02 12:33:03 -05:00
break ;
2016-04-09 12:48:18 -04:00
}
2013-03-01 22:45:49 +00:00
__map_bio ( tio ) ;
}
2016-03-02 12:33:03 -05:00
return r ;
2010-08-12 04:14:08 +01:00
}
2013-03-01 22:45:47 +00:00
typedef unsigned ( * get_num_bios_fn ) ( struct dm_target * ti ) ;
2012-12-21 20:23:37 +00:00
2013-03-01 22:45:47 +00:00
static unsigned get_num_discard_bios ( struct dm_target * ti )
2012-12-21 20:23:37 +00:00
{
2013-03-01 22:45:47 +00:00
return ti - > num_discard_bios ;
2012-12-21 20:23:37 +00:00
}
2013-03-01 22:45:47 +00:00
static unsigned get_num_write_same_bios ( struct dm_target * ti )
2012-12-21 20:23:37 +00:00
{
2013-03-01 22:45:47 +00:00
return ti - > num_write_same_bios ;
2012-12-21 20:23:37 +00:00
}
typedef bool ( * is_split_required_fn ) ( struct dm_target * ti ) ;
2010-02-16 18:43:01 +00:00
2012-12-21 20:23:37 +00:00
static bool is_split_required_for_discard ( struct dm_target * ti )
{
2013-03-01 22:45:47 +00:00
return ti - > split_discard_bios ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
}
2013-03-01 22:45:47 +00:00
static int __send_changing_extent_only ( struct clone_info * ci ,
get_num_bios_fn get_num_bios ,
is_split_required_fn is_split_required )
2012-09-26 23:45:42 +01:00
{
2010-08-12 04:14:08 +01:00
struct dm_target * ti ;
2014-03-14 18:40:39 -04:00
unsigned len ;
2013-03-01 22:45:47 +00:00
unsigned num_bios ;
2012-09-26 23:45:42 +01:00
2010-08-12 04:14:24 +01:00
do {
ti = dm_table_find_target ( ci - > map , ci - > sector ) ;
if ( ! dm_target_is_valid ( ti ) )
return - EIO ;
2014-10-17 17:46:36 -06:00
2010-08-12 04:14:08 +01:00
/*
2012-12-21 20:23:37 +00:00
* Even though the device advertised support for this type of
* request , that does not mean every target supports it , and
2011-08-02 12:32:01 +01:00
* reconfiguration might also have changed that since the
2010-08-12 04:14:24 +01:00
* check was performed .
2010-08-12 04:14:08 +01:00
*/
2013-03-01 22:45:47 +00:00
num_bios = get_num_bios ? get_num_bios ( ti ) : 0 ;
if ( ! num_bios )
2010-08-12 04:14:24 +01:00
return - EOPNOTSUPP ;
2012-09-26 23:45:42 +01:00
2012-12-21 20:23:37 +00:00
if ( is_split_required & & ! is_split_required ( ti ) )
2014-03-14 18:40:39 -04:00
len = min ( ( sector_t ) ci - > sector_count , max_io_len_target_boundary ( ci - > sector , ti ) ) ;
2012-07-27 15:08:03 +01:00
else
2014-03-14 18:40:39 -04:00
len = min ( ( sector_t ) ci - > sector_count , max_io_len ( ci - > sector , ti ) ) ;
2015-02-24 21:58:21 -05:00
2014-03-14 18:41:24 -04:00
__send_duplicate_bios ( ci , ti , num_bios , & len ) ;
2015-06-09 17:22:49 -04:00
2010-08-12 04:14:24 +01:00
ci - > sector + = len ;
} while ( ci - > sector_count - = len ) ;
2010-08-12 04:14:08 +01:00
return 0 ;
2012-09-26 23:45:42 +01:00
}
2013-03-01 22:45:47 +00:00
static int __send_discard ( struct clone_info * ci )
2012-12-21 20:23:37 +00:00
{
2013-03-01 22:45:47 +00:00
return __send_changing_extent_only ( ci , get_num_discard_bios ,
is_split_required_for_discard ) ;
2012-12-21 20:23:37 +00:00
}
2015-02-26 00:50:28 -05:00
2013-03-01 22:45:47 +00:00
static int __send_write_same ( struct clone_info * ci )
2015-02-26 00:50:28 -05:00
{
2013-03-01 22:45:47 +00:00
return __send_changing_extent_only ( ci , get_num_write_same_bios , NULL ) ;
2015-02-26 00:50:28 -05:00
}
2013-03-01 22:45:47 +00:00
/*
* Select the correct strategy for processing a non - flush bio .
*/
2013-03-01 22:45:47 +00:00
static int __split_and_process_non_flush ( struct clone_info * ci )
2015-02-26 00:50:28 -05:00
{
2012-10-12 21:02:15 +01:00
struct bio * bio = ci - > bio ;
2007-12-13 14:15:25 +00:00
struct dm_target * ti ;
2013-10-29 17:17:49 -07:00
unsigned len ;
2016-03-02 12:33:03 -05:00
int r ;
2015-02-26 00:50:28 -05:00
2016-06-05 14:32:04 -05:00
if ( unlikely ( bio_op ( bio ) = = REQ_OP_DISCARD ) )
2013-03-01 22:45:47 +00:00
return __send_discard ( ci ) ;
2016-06-05 14:32:04 -05:00
else if ( unlikely ( bio_op ( bio ) = = REQ_OP_WRITE_SAME ) )
2013-03-01 22:45:47 +00:00
return __send_write_same ( ci ) ;
2015-02-26 00:50:28 -05:00
2007-12-13 14:15:25 +00:00
ti = dm_table_find_target ( ci - > map , ci - > sector ) ;
if ( ! dm_target_is_valid ( ti ) )
return - EIO ;
2013-10-29 17:17:49 -07:00
len = min_t ( sector_t , max_io_len ( ci - > sector , ti ) , ci - > sector_count ) ;
2015-02-26 00:50:28 -05:00
2016-03-02 12:33:03 -05:00
r = __clone_and_map_data_bio ( ci , ti , ci - > sector , & len ) ;
if ( r < 0 )
return r ;
2015-02-26 00:50:28 -05:00
2013-10-29 17:17:49 -07:00
ci - > sector + = len ;
ci - > sector_count - = len ;
2015-02-26 00:50:28 -05:00
2013-10-29 17:17:49 -07:00
return 0 ;
2015-02-26 00:50:28 -05:00
}
2005-04-16 15:20:36 -07:00
/*
2013-03-01 22:45:47 +00:00
* Entry point to split a bio into clones and submit them to the targets .
2005-04-16 15:20:36 -07:00
*/
2013-07-10 23:41:18 +01:00
static void __split_and_process_bio ( struct mapped_device * md ,
struct dm_table * map , struct bio * bio )
2015-02-26 00:50:28 -05:00
{
2005-04-16 15:20:36 -07:00
struct clone_info ci ;
2007-12-13 14:15:25 +00:00
int error = 0 ;
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
if ( unlikely ( ! map ) ) {
2010-09-08 18:07:00 +02:00
bio_io_error ( bio ) ;
2009-04-02 19:55:38 +01:00
return ;
}
2009-04-09 00:27:13 +01:00
2013-07-10 23:41:18 +01:00
ci . map = map ;
2005-04-16 15:20:36 -07:00
ci . md = md ;
ci . io = alloc_io ( md ) ;
ci . io - > error = 0 ;
atomic_set ( & ci . io - > io_count , 1 ) ;
ci . io - > bio = bio ;
ci . io - > md = md ;
2009-10-16 23:18:15 +01:00
spin_lock_init ( & ci . io - > endio_lock ) ;
2013-10-11 15:44:27 -07:00
ci . sector = bio - > bi_iter . bi_sector ;
2015-02-26 00:50:28 -05:00
2006-02-01 03:04:53 -08:00
start_io_acct ( ci . io ) ;
2015-02-26 00:50:28 -05:00
2016-06-05 14:32:25 -05:00
if ( bio - > bi_rw & REQ_PREFLUSH ) {
2010-09-08 18:07:01 +02:00
ci . bio = & ci . md - > flush_bio ;
ci . sector_count = 0 ;
2013-03-01 22:45:47 +00:00
error = __send_empty_flush ( & ci ) ;
2010-09-08 18:07:01 +02:00
/* dec_pending submits any data associated with flush */
} else {
2010-09-08 18:07:00 +02:00
ci . bio = bio ;
2010-09-03 11:56:19 +02:00
ci . sector_count = bio_sectors ( bio ) ;
2010-09-08 18:07:01 +02:00
while ( ci . sector_count & & ! error )
2013-03-01 22:45:47 +00:00
error = __split_and_process_non_flush ( & ci ) ;
2010-09-03 11:56:19 +02:00
}
2015-02-26 00:50:28 -05:00
2005-04-16 15:20:36 -07:00
/* drop the extra reference count */
2007-12-13 14:15:25 +00:00
dec_pending ( ci . io , error ) ;
2015-02-26 00:50:28 -05:00
}
2005-04-16 15:20:36 -07:00
/*-----------------------------------------------------------------
* CRUD END
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
2015-02-26 00:50:28 -05:00
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
/*
2005-04-16 15:20:36 -07:00
* The request function that just remaps the bio built up by
* dm_merge_bvec .
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
*/
2015-11-05 10:41:16 -07:00
static blk_qc_t dm_make_request ( struct request_queue * q , struct bio * bio )
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
{
2006-02-01 03:04:52 -08:00
int rw = bio_data_dir ( bio ) ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
struct mapped_device * md = q - > queuedata ;
2013-07-10 23:41:18 +01:00
int srcu_idx ;
struct dm_table * map ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
2013-07-10 23:41:18 +01:00
map = dm_get_live_table ( md , & srcu_idx ) ;
2010-09-08 18:07:00 +02:00
2014-11-24 11:05:26 +08:00
generic_start_io_acct ( rw , bio_sectors ( bio ) , & dm_disk ( md ) - > part0 ) ;
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:18 +00:00
2010-09-08 18:07:00 +02:00
/* if we're suspended, we have to queue this io for later */
if ( unlikely ( test_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ) ) {
2013-07-10 23:41:18 +01:00
dm_put_live_table ( md , srcu_idx ) ;
2010-02-16 18:43:01 +00:00
2016-07-26 17:12:11 -07:00
if ( ! ( bio - > bi_rw & REQ_RAHEAD ) )
2010-09-08 18:07:00 +02:00
queue_io ( md , bio ) ;
else
2009-04-09 00:27:14 +01:00
bio_io_error ( bio ) ;
2015-11-05 10:41:16 -07:00
return BLK_QC_T_NONE ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
}
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
__split_and_process_bio ( md , map , bio ) ;
dm_put_live_table ( md , srcu_idx ) ;
2015-11-05 10:41:16 -07:00
return BLK_QC_T_NONE ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
}
2005-04-16 15:20:36 -07:00
static int dm_any_congested ( void * congested_data , int bdi_bits )
{
2008-11-13 23:39:14 +00:00
int r = bdi_bits ;
struct mapped_device * md = congested_data ;
struct dm_table * map ;
2005-04-16 15:20:36 -07:00
2009-04-09 00:27:14 +01:00
if ( ! test_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ) {
2016-02-02 22:35:06 -05:00
if ( dm_request_based ( md ) ) {
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
/*
2016-02-02 22:35:06 -05:00
* With request - based DM we only need to check the
* top - level queue for congestion .
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
*/
2016-02-02 22:35:06 -05:00
r = md - > queue - > backing_dev_info . wb . state & bdi_bits ;
} else {
map = dm_get_live_table_fast ( md ) ;
if ( map )
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
r = dm_table_any_congested ( map , bdi_bits ) ;
2016-02-02 22:35:06 -05:00
dm_put_live_table_fast ( md ) ;
2008-11-13 23:39:14 +00:00
}
}
2005-04-16 15:20:36 -07:00
return r ;
}
/*-----------------------------------------------------------------
* An IDR is used to keep track of allocated minor numbers .
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
2006-06-26 00:27:32 -07:00
static void free_minor ( int minor )
2005-04-16 15:20:36 -07:00
{
2006-06-26 00:27:22 -07:00
spin_lock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
idr_remove ( & _minor_idr , minor ) ;
2006-06-26 00:27:22 -07:00
spin_unlock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
}
/*
* See if the device with a specific minor # is free .
*/
2008-04-24 22:10:59 +01:00
static int specific_minor ( int minor )
2005-04-16 15:20:36 -07:00
{
2013-02-27 17:04:26 -08:00
int r ;
2005-04-16 15:20:36 -07:00
if ( minor > = ( 1 < < MINORBITS ) )
return - EINVAL ;
2013-02-27 17:04:26 -08:00
idr_preload ( GFP_KERNEL ) ;
2006-06-26 00:27:22 -07:00
spin_lock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
2013-02-27 17:04:26 -08:00
r = idr_alloc ( & _minor_idr , MINOR_ALLOCED , minor , minor + 1 , GFP_NOWAIT ) ;
2005-04-16 15:20:36 -07:00
2006-06-26 00:27:22 -07:00
spin_unlock ( & _minor_lock ) ;
2013-02-27 17:04:26 -08:00
idr_preload_end ( ) ;
if ( r < 0 )
return r = = - ENOSPC ? - EBUSY : r ;
return 0 ;
2005-04-16 15:20:36 -07:00
}
2008-04-24 22:10:59 +01:00
static int next_free_minor ( int * minor )
2005-04-16 15:20:36 -07:00
{
2013-02-27 17:04:26 -08:00
int r ;
2006-06-26 00:27:21 -07:00
2013-02-27 17:04:26 -08:00
idr_preload ( GFP_KERNEL ) ;
2006-06-26 00:27:22 -07:00
spin_lock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
2013-02-27 17:04:26 -08:00
r = idr_alloc ( & _minor_idr , MINOR_ALLOCED , 0 , 1 < < MINORBITS , GFP_NOWAIT ) ;
2005-04-16 15:20:36 -07:00
2006-06-26 00:27:22 -07:00
spin_unlock ( & _minor_lock ) ;
2013-02-27 17:04:26 -08:00
idr_preload_end ( ) ;
if ( r < 0 )
return r ;
* minor = r ;
return 0 ;
2005-04-16 15:20:36 -07:00
}
2009-09-21 17:01:13 -07:00
static const struct block_device_operations dm_blk_dops ;
2005-04-16 15:20:36 -07:00
2009-04-02 19:55:37 +01:00
static void dm_wq_work ( struct work_struct * work ) ;
2016-05-12 16:28:10 -04:00
void dm_init_md_queue ( struct mapped_device * md )
2010-08-12 04:14:02 +01:00
{
/*
* Request - based dm devices cannot be stacked on top of bio - based dm
2015-03-08 00:51:47 -05:00
* devices . The type of this dm device may not have been decided yet .
2010-08-12 04:14:02 +01:00
* The type is decided at the first table loading time .
* To prevent problematic device stacking , clear the queue flag
* for request stacking support until then .
*
* This queue is new , so no concurrency on the queue_flags .
*/
queue_flag_clear_unlocked ( QUEUE_FLAG_STACKABLE , md - > queue ) ;
2015-10-27 19:06:55 -04:00
/*
* Initialize data that will only be used by a non - blk - mq DM queue
* - must do so here ( in alloc_dev callchain ) before queue is used
*/
md - > queue - > queuedata = md ;
md - > queue - > backing_dev_info . congested_data = md ;
2015-03-08 00:51:47 -05:00
}
2010-08-12 04:14:02 +01:00
2016-05-12 16:28:10 -04:00
void dm_init_normal_md_queue ( struct mapped_device * md )
2015-03-08 00:51:47 -05:00
{
2015-03-11 15:01:09 -04:00
md - > use_blk_mq = false ;
2015-03-08 00:51:47 -05:00
dm_init_md_queue ( md ) ;
/*
* Initialize aspects of queue that aren ' t relevant for blk - mq
*/
2010-08-12 04:14:02 +01:00
md - > queue - > backing_dev_info . congested_fn = dm_any_congested ;
blk_queue_bounce_limit ( md - > queue , BLK_BOUNCE_ANY ) ;
}
2015-04-28 11:50:29 -04:00
static void cleanup_mapped_device ( struct mapped_device * md )
{
if ( md - > wq )
destroy_workqueue ( md - > wq ) ;
if ( md - > kworker_task )
kthread_stop ( md - > kworker_task ) ;
2015-09-13 14:15:05 +02:00
mempool_destroy ( md - > io_pool ) ;
mempool_destroy ( md - > rq_pool ) ;
2015-04-28 11:50:29 -04:00
if ( md - > bs )
bioset_free ( md - > bs ) ;
2015-07-10 17:21:43 -04:00
cleanup_srcu_struct ( & md - > io_barrier ) ;
2015-04-28 11:50:29 -04:00
if ( md - > disk ) {
spin_lock ( & _minor_lock ) ;
md - > disk - > private_data = NULL ;
spin_unlock ( & _minor_lock ) ;
del_gendisk ( md - > disk ) ;
put_disk ( md - > disk ) ;
}
if ( md - > queue )
blk_cleanup_queue ( md - > queue ) ;
if ( md - > bdev ) {
bdput ( md - > bdev ) ;
md - > bdev = NULL ;
}
2016-05-12 16:28:10 -04:00
dm_mq_cleanup_mapped_device ( md ) ;
2015-04-28 11:50:29 -04:00
}
2005-04-16 15:20:36 -07:00
/*
* Allocate and initialise a blank device with a given minor .
*/
2006-06-26 00:27:32 -07:00
static struct mapped_device * alloc_dev ( int minor )
2005-04-16 15:20:36 -07:00
{
2016-02-22 12:16:21 -05:00
int r , numa_node_id = dm_get_numa_node ( ) ;
struct mapped_device * md ;
2006-06-26 00:27:21 -07:00
void * old_md ;
2005-04-16 15:20:36 -07:00
2016-02-22 12:16:21 -05:00
md = kzalloc_node ( sizeof ( * md ) , GFP_KERNEL , numa_node_id ) ;
2005-04-16 15:20:36 -07:00
if ( ! md ) {
DMWARN ( " unable to allocate device, out of memory. " ) ;
return NULL ;
}
2006-06-26 00:27:25 -07:00
if ( ! try_module_get ( THIS_MODULE ) )
2008-02-08 02:10:19 +00:00
goto bad_module_get ;
2006-06-26 00:27:25 -07:00
2005-04-16 15:20:36 -07:00
/* get a minor number for the dev */
2006-06-26 00:27:32 -07:00
if ( minor = = DM_ANY_MINOR )
2008-04-24 22:10:59 +01:00
r = next_free_minor ( & minor ) ;
2006-06-26 00:27:32 -07:00
else
2008-04-24 22:10:59 +01:00
r = specific_minor ( minor ) ;
2005-04-16 15:20:36 -07:00
if ( r < 0 )
2008-02-08 02:10:19 +00:00
goto bad_minor ;
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
r = init_srcu_struct ( & md - > io_barrier ) ;
if ( r < 0 )
goto bad_io_barrier ;
2016-02-22 12:16:21 -05:00
md - > numa_node_id = numa_node_id ;
2016-05-12 16:28:10 -04:00
md - > use_blk_mq = dm_use_blk_mq_default ( ) ;
2016-01-31 12:05:42 -05:00
md - > init_tio_pdu = false ;
2010-08-12 04:14:01 +01:00
md - > type = DM_TYPE_NONE ;
2008-02-08 02:10:08 +00:00
mutex_init ( & md - > suspend_lock ) ;
2010-08-12 04:14:01 +01:00
mutex_init ( & md - > type_lock ) ;
2014-08-13 13:53:43 -05:00
mutex_init ( & md - > table_devices_lock ) ;
2009-04-02 19:55:39 +01:00
spin_lock_init ( & md - > deferred_lock ) ;
2005-04-16 15:20:36 -07:00
atomic_set ( & md - > holders , 1 ) ;
2006-06-26 00:27:34 -07:00
atomic_set ( & md - > open_count , 0 ) ;
2005-04-16 15:20:36 -07:00
atomic_set ( & md - > event_nr , 0 ) ;
2007-10-19 22:48:01 +01:00
atomic_set ( & md - > uevent_seq , 0 ) ;
INIT_LIST_HEAD ( & md - > uevent_list ) ;
2014-08-13 13:53:43 -05:00
INIT_LIST_HEAD ( & md - > table_devices ) ;
2007-10-19 22:48:01 +01:00
spin_lock_init ( & md - > uevent_lock ) ;
2005-04-16 15:20:36 -07:00
2016-02-22 12:16:21 -05:00
md - > queue = blk_alloc_queue_node ( GFP_KERNEL , numa_node_id ) ;
2005-04-16 15:20:36 -07:00
if ( ! md - > queue )
2015-04-28 11:50:29 -04:00
goto bad ;
2005-04-16 15:20:36 -07:00
2010-08-12 04:14:02 +01:00
dm_init_md_queue ( md ) ;
2006-10-03 01:15:41 -07:00
2016-02-22 12:16:21 -05:00
md - > disk = alloc_disk_node ( 1 , numa_node_id ) ;
2005-04-16 15:20:36 -07:00
if ( ! md - > disk )
2015-04-28 11:50:29 -04:00
goto bad ;
2005-04-16 15:20:36 -07:00
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 20:16:55 +02:00
atomic_set ( & md - > pending [ 0 ] , 0 ) ;
atomic_set ( & md - > pending [ 1 ] , 0 ) ;
2006-06-26 00:27:25 -07:00
init_waitqueue_head ( & md - > wait ) ;
2009-04-02 19:55:37 +01:00
INIT_WORK ( & md - > work , dm_wq_work ) ;
2006-06-26 00:27:25 -07:00
init_waitqueue_head ( & md - > eventq ) ;
2014-01-13 19:37:54 -05:00
init_completion ( & md - > kobj_holder . completion ) ;
2014-10-17 17:46:36 -06:00
md - > kworker_task = NULL ;
2006-06-26 00:27:25 -07:00
2005-04-16 15:20:36 -07:00
md - > disk - > major = _major ;
md - > disk - > first_minor = minor ;
md - > disk - > fops = & dm_blk_dops ;
md - > disk - > queue = md - > queue ;
md - > disk - > private_data = md ;
sprintf ( md - > disk - > disk_name , " dm-%d " , minor ) ;
add_disk ( md - > disk ) ;
2006-03-27 01:17:52 -08:00
format_dev_t ( md - > name , MKDEV ( _major , minor ) ) ;
2005-04-16 15:20:36 -07:00
2013-07-30 08:40:21 -04:00
md - > wq = alloc_workqueue ( " kdmflush " , WQ_MEM_RECLAIM , 0 ) ;
2008-02-08 02:11:17 +00:00
if ( ! md - > wq )
2015-04-28 11:50:29 -04:00
goto bad ;
2008-02-08 02:11:17 +00:00
2009-06-22 10:12:17 +01:00
md - > bdev = bdget_disk ( md - > disk , 0 ) ;
if ( ! md - > bdev )
2015-04-28 11:50:29 -04:00
goto bad ;
2009-06-22 10:12:17 +01:00
2010-09-08 18:07:00 +02:00
bio_init ( & md - > flush_bio ) ;
md - > flush_bio . bi_bdev = md - > bdev ;
2016-06-05 14:32:04 -05:00
bio_set_op_attrs ( & md - > flush_bio , REQ_OP_WRITE , WRITE_FLUSH ) ;
2010-09-08 18:07:00 +02:00
2013-08-16 10:54:23 -04:00
dm_stats_init ( & md - > stats ) ;
2006-06-26 00:27:21 -07:00
/* Populate the mapping, nobody knows we exist yet */
2006-06-26 00:27:22 -07:00
spin_lock ( & _minor_lock ) ;
2006-06-26 00:27:21 -07:00
old_md = idr_replace ( & _minor_idr , md , minor ) ;
2006-06-26 00:27:22 -07:00
spin_unlock ( & _minor_lock ) ;
2006-06-26 00:27:21 -07:00
BUG_ON ( old_md ! = MINOR_ALLOCED ) ;
2005-04-16 15:20:36 -07:00
return md ;
2015-04-28 11:50:29 -04:00
bad :
cleanup_mapped_device ( md ) ;
2013-07-10 23:41:18 +01:00
bad_io_barrier :
2005-04-16 15:20:36 -07:00
free_minor ( minor ) ;
2008-02-08 02:10:19 +00:00
bad_minor :
2006-06-26 00:27:25 -07:00
module_put ( THIS_MODULE ) ;
2008-02-08 02:10:19 +00:00
bad_module_get :
2005-04-16 15:20:36 -07:00
kfree ( md ) ;
return NULL ;
}
2007-10-19 22:38:43 +01:00
static void unlock_fs ( struct mapped_device * md ) ;
2005-04-16 15:20:36 -07:00
static void free_dev ( struct mapped_device * md )
{
2008-09-03 09:01:48 +02:00
int minor = MINOR ( disk_devt ( md - > disk ) ) ;
2006-02-24 13:04:25 -08:00
2009-06-22 10:12:17 +01:00
unlock_fs ( md ) ;
2014-10-17 17:46:36 -06:00
2015-04-28 11:50:29 -04:00
cleanup_mapped_device ( md ) ;
2015-03-23 17:01:43 -04:00
2014-08-13 13:53:43 -05:00
free_table_devices ( & md - > table_devices ) ;
2015-03-23 17:01:43 -04:00
dm_stats_cleanup ( & md - > stats ) ;
free_minor ( minor ) ;
2006-06-26 00:27:25 -07:00
module_put ( THIS_MODULE ) ;
2005-04-16 15:20:36 -07:00
kfree ( md ) ;
}
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
static void __bind_mempools ( struct mapped_device * md , struct dm_table * t )
{
2012-12-21 20:23:38 +00:00
struct dm_md_mempools * p = dm_table_get_md_mempools ( t ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
2015-06-26 09:42:57 -04:00
if ( md - > bs ) {
/* The md already has necessary mempools. */
2016-06-22 17:54:53 -06:00
if ( dm_table_bio_based ( t ) ) {
2013-03-01 22:45:44 +00:00
/*
* Reload bioset because front_pad may have changed
* because a different table was loaded .
*/
bioset_free ( md - > bs ) ;
md - > bs = p - > bs ;
p - > bs = NULL ;
}
2015-06-26 09:42:57 -04:00
/*
* There ' s no need to reload with request - based dm
* because the size of front_pad doesn ' t change .
* Note for future : If you are to reload bioset ,
* prep - ed requests in the queue may refer
* to bio from the old bioset , so you must walk
* through the queue to unprep .
*/
goto out ;
2012-12-21 20:23:38 +00:00
}
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
2015-04-27 16:37:50 -04:00
BUG_ON ( ! p | | md - > io_pool | | md - > rq_pool | | md - > bs ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
md - > io_pool = p - > io_pool ;
p - > io_pool = NULL ;
2014-12-05 17:11:05 -05:00
md - > rq_pool = p - > rq_pool ;
p - > rq_pool = NULL ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
md - > bs = p - > bs ;
p - > bs = NULL ;
2015-06-26 09:42:57 -04:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
out :
2015-03-10 23:49:26 -04:00
/* mempool bind completed, no longer need any mempools in the table */
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
dm_table_free_md_mempools ( t ) ;
}
2005-04-16 15:20:36 -07:00
/*
* Bind a table to the device .
*/
static void event_callback ( void * context )
{
2007-10-19 22:48:01 +01:00
unsigned long flags ;
LIST_HEAD ( uevents ) ;
2005-04-16 15:20:36 -07:00
struct mapped_device * md = ( struct mapped_device * ) context ;
2007-10-19 22:48:01 +01:00
spin_lock_irqsave ( & md - > uevent_lock , flags ) ;
list_splice_init ( & md - > uevent_list , & uevents ) ;
spin_unlock_irqrestore ( & md - > uevent_lock , flags ) ;
2008-08-25 19:56:05 +09:00
dm_send_uevents ( & uevents , & disk_to_dev ( md - > disk ) - > kobj ) ;
2007-10-19 22:48:01 +01:00
2005-04-16 15:20:36 -07:00
atomic_inc ( & md - > event_nr ) ;
wake_up ( & md - > eventq ) ;
}
2011-01-13 19:53:46 +00:00
/*
* Protected by md - > suspend_lock obtained by dm_swap_table ( ) .
*/
2005-07-28 21:15:59 -07:00
static void __set_size ( struct mapped_device * md , sector_t size )
2005-04-16 15:20:36 -07:00
{
2005-07-28 21:15:59 -07:00
set_capacity ( md - > disk , size ) ;
2005-04-16 15:20:36 -07:00
2009-06-22 10:12:15 +01:00
i_size_write ( md - > bdev - > bd_inode , ( loff_t ) size < < SECTOR_SHIFT ) ;
2005-04-16 15:20:36 -07:00
}
2009-12-10 23:52:24 +00:00
/*
* Returns old map , which caller must destroy .
*/
static struct dm_table * __bind ( struct mapped_device * md , struct dm_table * t ,
struct queue_limits * limits )
2005-04-16 15:20:36 -07:00
{
2009-12-10 23:52:24 +00:00
struct dm_table * old_map ;
2007-07-24 09:28:11 +02:00
struct request_queue * q = md - > queue ;
2005-04-16 15:20:36 -07:00
sector_t size ;
size = dm_table_get_size ( t ) ;
2006-03-27 01:17:54 -08:00
/*
* Wipe any geometry if the size of the table changed .
*/
2013-08-16 10:54:23 -04:00
if ( size ! = dm_get_size ( md ) )
2006-03-27 01:17:54 -08:00
memset ( & md - > geometry , 0 , sizeof ( md - > geometry ) ) ;
2009-06-22 10:12:17 +01:00
__set_size ( md , size ) ;
dm table: rework reference counting
Rework table reference counting.
The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.
This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
same process.
* stale reference from some other kernel code keeps the table
constructed, which keeps some devices open, even after successful
return from "dmsetup remove". This can confuse lvm and prevent closing
of underlying devices or reusing device minor numbers.
The patch changes reference counting so that the table destructor can be
called only at predetermined places.
The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders. A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.
Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.
When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor. We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.
This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.
Finally remove the temporary protection added to dm_any_congested().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:10 +00:00
2005-07-28 21:16:00 -07:00
dm_table_event_callback ( t , event_callback , md ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
/*
* The queue hasn ' t been stopped yet , if the old table type wasn ' t
* for request - based during suspension . So stop it to prevent
* I / O mapping before resume .
* This must be done before setting the queue restrictions ,
* because request - based dm may be run just after the setting .
*/
2016-01-31 17:22:27 -05:00
if ( dm_table_request_based ( t ) ) {
2016-02-20 13:45:38 -05:00
dm_stop_queue ( q ) ;
2016-01-31 17:22:27 -05:00
/*
* Leverage the fact that request - based DM targets are
* immutable singletons and establish md - > immutable_target
* - used to optimize both dm_request_fn and dm_mq_queue_rq
*/
md - > immutable_target = dm_table_get_immutable_target ( t ) ;
}
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
__bind_mempools ( md , t ) ;
2014-11-23 09:34:29 -08:00
old_map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2016-02-22 14:14:24 -05:00
rcu_assign_pointer ( md - > map , ( void * ) t ) ;
2011-10-31 20:19:04 +00:00
md - > immutable_target_type = dm_table_get_immutable_target_type ( t ) ;
2009-06-22 10:12:34 +01:00
dm_table_set_restrictions ( t , q , limits ) ;
2014-11-05 14:35:50 +01:00
if ( old_map )
dm_sync_table ( md ) ;
2005-04-16 15:20:36 -07:00
2009-12-10 23:52:24 +00:00
return old_map ;
2005-04-16 15:20:36 -07:00
}
2009-12-10 23:52:23 +00:00
/*
* Returns unbound table for the caller to free .
*/
static struct dm_table * __unbind ( struct mapped_device * md )
2005-04-16 15:20:36 -07:00
{
2014-11-23 09:34:29 -08:00
struct dm_table * map = rcu_dereference_protected ( md - > map , 1 ) ;
2005-04-16 15:20:36 -07:00
if ( ! map )
2009-12-10 23:52:23 +00:00
return NULL ;
2005-04-16 15:20:36 -07:00
dm_table_event_callback ( map , NULL , NULL ) ;
2014-03-23 23:58:27 +05:30
RCU_INIT_POINTER ( md - > map , NULL ) ;
2013-07-10 23:41:18 +01:00
dm_sync_table ( md ) ;
2009-12-10 23:52:23 +00:00
return map ;
2005-04-16 15:20:36 -07:00
}
/*
* Constructor for a new device .
*/
2006-06-26 00:27:32 -07:00
int dm_create ( int minor , struct mapped_device * * result )
2005-04-16 15:20:36 -07:00
{
struct mapped_device * md ;
2006-06-26 00:27:32 -07:00
md = alloc_dev ( minor ) ;
2005-04-16 15:20:36 -07:00
if ( ! md )
return - ENXIO ;
2009-01-06 03:05:12 +00:00
dm_sysfs_init ( md ) ;
2005-04-16 15:20:36 -07:00
* result = md ;
return 0 ;
}
2010-08-12 04:14:01 +01:00
/*
* Functions to manage md - > type .
* All are required to hold md - > type_lock .
*/
void dm_lock_md_type ( struct mapped_device * md )
{
mutex_lock ( & md - > type_lock ) ;
}
void dm_unlock_md_type ( struct mapped_device * md )
{
mutex_unlock ( & md - > type_lock ) ;
}
void dm_set_md_type ( struct mapped_device * md , unsigned type )
{
2013-08-27 18:57:03 -04:00
BUG_ON ( ! mutex_is_locked ( & md - > type_lock ) ) ;
2010-08-12 04:14:01 +01:00
md - > type = type ;
}
unsigned dm_get_md_type ( struct mapped_device * md )
{
return md - > type ;
}
2011-10-31 20:19:04 +00:00
struct target_type * dm_get_immutable_target_type ( struct mapped_device * md )
{
return md - > immutable_target_type ;
}
dm mpath: disable WRITE SAME if it fails
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.
The WRITE SAME heuristics, with both the original commit 5db44863b6eb
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).
When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled. This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.
This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop). A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.
Before this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920
After this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0
It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # 3.10+
2013-09-19 12:13:58 -04:00
/*
* The queue_limits are only valid as long as you have a reference
* count on ' md ' .
*/
struct queue_limits * dm_get_queue_limits ( struct mapped_device * md )
{
BUG_ON ( ! atomic_read ( & md - > holders ) ) ;
return & md - > queue - > limits ;
}
EXPORT_SYMBOL_GPL ( dm_get_queue_limits ) ;
2010-08-12 04:14:02 +01:00
/*
* Setup the DM device ' s queue based on md ' s type
*/
2016-01-31 12:05:42 -05:00
int dm_setup_md_queue ( struct mapped_device * md , struct dm_table * t )
2010-08-12 04:14:02 +01:00
{
2015-03-08 00:51:47 -05:00
int r ;
2016-06-22 17:54:53 -06:00
unsigned type = dm_get_md_type ( md ) ;
2015-03-08 00:51:47 -05:00
2016-06-22 17:54:53 -06:00
switch ( type ) {
2015-03-08 00:51:47 -05:00
case DM_TYPE_REQUEST_BASED :
2016-02-20 13:45:38 -05:00
r = dm_old_init_request_queue ( md ) ;
2015-03-08 00:51:47 -05:00
if ( r ) {
2016-02-20 13:45:38 -05:00
DMERR ( " Cannot initialize queue for request-based mapped device " ) ;
2015-03-08 00:51:47 -05:00
return r ;
2015-02-23 17:56:37 -05:00
}
2015-03-08 00:51:47 -05:00
break ;
case DM_TYPE_MQ_REQUEST_BASED :
2016-05-24 21:16:51 -04:00
r = dm_mq_init_request_queue ( md , t ) ;
2015-03-08 00:51:47 -05:00
if ( r ) {
2016-02-20 13:45:38 -05:00
DMERR ( " Cannot initialize queue for request-based dm-mq mapped device " ) ;
2015-03-08 00:51:47 -05:00
return r ;
}
break ;
case DM_TYPE_BIO_BASED :
2016-06-22 17:54:53 -06:00
case DM_TYPE_DAX_BIO_BASED :
2016-02-20 13:45:38 -05:00
dm_init_normal_md_queue ( md ) ;
2015-02-23 17:56:37 -05:00
blk_queue_make_request ( md - > queue , dm_make_request ) ;
2015-10-21 16:34:20 -04:00
/*
* DM handles splitting bios as needed . Free the bio_split bioset
* since it won ' t be used ( saves 1 process per bio - based DM device ) .
*/
bioset_free ( md - > queue - > bio_split ) ;
md - > queue - > bio_split = NULL ;
2016-06-22 17:54:53 -06:00
if ( type = = DM_TYPE_DAX_BIO_BASED )
queue_flag_set_unlocked ( QUEUE_FLAG_DAX , md - > queue ) ;
2015-03-08 00:51:47 -05:00
break ;
2010-08-12 04:14:02 +01:00
}
return 0 ;
}
dm: fix a race condition in dm_get_md
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-17 14:30:53 -05:00
struct mapped_device * dm_get_md ( dev_t dev )
2005-04-16 15:20:36 -07:00
{
struct mapped_device * md ;
unsigned minor = MINOR ( dev ) ;
if ( MAJOR ( dev ) ! = _major | | minor > = ( 1 < < MINORBITS ) )
return NULL ;
2006-06-26 00:27:22 -07:00
spin_lock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
md = idr_find ( & _minor_idr , minor ) ;
dm: fix a race condition in dm_get_md
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-17 14:30:53 -05:00
if ( md ) {
if ( ( md = = MINOR_ALLOCED | |
( MINOR ( disk_devt ( dm_disk ( md ) ) ) ! = minor ) | |
dm_deleting_md ( md ) | |
test_bit ( DMF_FREEING , & md - > flags ) ) ) {
md = NULL ;
goto out ;
}
dm_get ( md ) ;
2006-06-26 00:27:23 -07:00
}
2005-04-16 15:20:36 -07:00
2006-06-26 00:27:23 -07:00
out :
2006-06-26 00:27:22 -07:00
spin_unlock ( & _minor_lock ) ;
2005-04-16 15:20:36 -07:00
2006-01-06 00:20:00 -08:00
return md ;
}
2011-10-31 20:19:06 +00:00
EXPORT_SYMBOL_GPL ( dm_get_md ) ;
2006-01-06 00:20:01 -08:00
2006-03-27 01:17:53 -08:00
void * dm_get_mdptr ( struct mapped_device * md )
2006-01-06 00:20:00 -08:00
{
2006-03-27 01:17:53 -08:00
return md - > interface_ptr ;
2005-04-16 15:20:36 -07:00
}
void dm_set_mdptr ( struct mapped_device * md , void * ptr )
{
md - > interface_ptr = ptr ;
}
void dm_get ( struct mapped_device * md )
{
atomic_inc ( & md - > holders ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
BUG_ON ( test_bit ( DMF_FREEING , & md - > flags ) ) ;
2005-04-16 15:20:36 -07:00
}
dm snapshot: suspend merging snapshot when doing exception handover
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.
However, a similar problem exists in snapshot merging. When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin". Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.
To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.
Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.
In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear). Then we release _origins_lock, suspend the
device and grab _origins_lock again.
NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-26 11:41:28 -05:00
int dm_hold ( struct mapped_device * md )
{
spin_lock ( & _minor_lock ) ;
if ( test_bit ( DMF_FREEING , & md - > flags ) ) {
spin_unlock ( & _minor_lock ) ;
return - EBUSY ;
}
dm_get ( md ) ;
spin_unlock ( & _minor_lock ) ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( dm_hold ) ;
2006-06-26 00:27:35 -07:00
const char * dm_device_name ( struct mapped_device * md )
{
return md - > name ;
}
EXPORT_SYMBOL_GPL ( dm_device_name ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
static void __dm_destroy ( struct mapped_device * md , bool wait )
2005-04-16 15:20:36 -07:00
{
2006-03-27 01:17:54 -08:00
struct dm_table * map ;
2013-07-10 23:41:18 +01:00
int srcu_idx ;
2005-04-16 15:20:36 -07:00
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
might_sleep ( ) ;
2006-06-26 00:27:23 -07:00
2015-03-23 17:01:43 -04:00
spin_lock ( & _minor_lock ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
idr_replace ( & _minor_idr , MINOR_ALLOCED , MINOR ( disk_devt ( dm_disk ( md ) ) ) ) ;
set_bit ( DMF_FREEING , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
2015-03-10 23:49:26 -04:00
if ( dm_request_based ( md ) & & md - > kworker_task )
2014-10-17 17:46:36 -06:00
flush_kthread_worker ( & md - > kworker ) ;
2015-02-27 14:04:27 -05:00
/*
* Take suspend_lock so that presuspend and postsuspend methods
* do not race with internal suspend .
*/
mutex_lock ( & md - > suspend_lock ) ;
2015-10-01 08:31:51 +00:00
map = dm_get_live_table ( md , & srcu_idx ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
if ( ! dm_suspended_md ( md ) ) {
dm_table_presuspend_targets ( map ) ;
dm_table_postsuspend_targets ( map ) ;
2005-04-16 15:20:36 -07:00
}
2013-07-10 23:41:18 +01:00
/* dm_put_live_table must be before msleep, otherwise deadlock is possible */
dm_put_live_table ( md , srcu_idx ) ;
2015-10-01 08:31:51 +00:00
mutex_unlock ( & md - > suspend_lock ) ;
2013-07-10 23:41:18 +01:00
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
/*
* Rare , but there may be I / O requests still going to complete ,
* for example . Wait for all references to disappear .
* No one should increment the reference count of the mapped_device ,
* after the mapped_device state becomes DMF_FREEING .
*/
if ( wait )
while ( atomic_read ( & md - > holders ) )
msleep ( 1 ) ;
else if ( atomic_read ( & md - > holders ) )
DMWARN ( " %s: Forcibly removing mapped_device still in use! (%d users) " ,
dm_device_name ( md ) , atomic_read ( & md - > holders ) ) ;
dm_sysfs_exit ( md ) ;
dm_table_destroy ( __unbind ( md ) ) ;
free_dev ( md ) ;
}
void dm_destroy ( struct mapped_device * md )
{
__dm_destroy ( md , true ) ;
}
void dm_destroy_immediate ( struct mapped_device * md )
{
__dm_destroy ( md , false ) ;
}
void dm_put ( struct mapped_device * md )
{
atomic_dec ( & md - > holders ) ;
2005-04-16 15:20:36 -07:00
}
2007-05-09 02:32:56 -07:00
EXPORT_SYMBOL_GPL ( dm_put ) ;
2005-04-16 15:20:36 -07:00
2009-04-02 19:55:38 +01:00
static int dm_wait_for_completion ( struct mapped_device * md , int interruptible )
2008-02-08 02:10:30 +00:00
{
int r = 0 ;
2009-04-02 19:55:39 +01:00
DECLARE_WAITQUEUE ( wait , current ) ;
add_wait_queue ( & md - > wait , & wait ) ;
2008-02-08 02:10:30 +00:00
while ( 1 ) {
2009-04-02 19:55:38 +01:00
set_current_state ( interruptible ) ;
2008-02-08 02:10:30 +00:00
2009-12-10 23:52:16 +00:00
if ( ! md_in_flight ( md ) )
2008-02-08 02:10:30 +00:00
break ;
2009-04-02 19:55:38 +01:00
if ( interruptible = = TASK_INTERRUPTIBLE & &
signal_pending ( current ) ) {
2008-02-08 02:10:30 +00:00
r = - EINTR ;
break ;
}
io_schedule ( ) ;
}
set_current_state ( TASK_RUNNING ) ;
2009-04-02 19:55:39 +01:00
remove_wait_queue ( & md - > wait , & wait ) ;
2008-02-08 02:10:30 +00:00
return r ;
}
2005-04-16 15:20:36 -07:00
/*
* Process the deferred bios
*/
2009-04-02 19:55:38 +01:00
static void dm_wq_work ( struct work_struct * work )
2005-04-16 15:20:36 -07:00
{
2009-04-02 19:55:38 +01:00
struct mapped_device * md = container_of ( work , struct mapped_device ,
work ) ;
2008-02-08 02:10:22 +00:00
struct bio * c ;
2013-07-10 23:41:18 +01:00
int srcu_idx ;
struct dm_table * map ;
2005-04-16 15:20:36 -07:00
2013-07-10 23:41:18 +01:00
map = dm_get_live_table ( md , & srcu_idx ) ;
2009-04-02 19:55:38 +01:00
2009-04-09 00:27:15 +01:00
while ( ! test_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ) {
2009-04-09 00:27:13 +01:00
spin_lock_irq ( & md - > deferred_lock ) ;
c = bio_list_pop ( & md - > deferred ) ;
spin_unlock_irq ( & md - > deferred_lock ) ;
2010-09-08 18:07:00 +02:00
if ( ! c )
2009-04-09 00:27:13 +01:00
break ;
2009-04-02 19:55:39 +01:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
if ( dm_request_based ( md ) )
generic_make_request ( c ) ;
2010-09-08 18:07:00 +02:00
else
2013-07-10 23:41:18 +01:00
__split_and_process_bio ( md , map , c ) ;
2009-04-02 19:55:39 +01:00
}
2008-02-08 02:10:25 +00:00
2013-07-10 23:41:18 +01:00
dm_put_live_table ( md , srcu_idx ) ;
2005-04-16 15:20:36 -07:00
}
2009-04-02 19:55:36 +01:00
static void dm_queue_flush ( struct mapped_device * md )
2008-02-08 02:11:17 +00:00
{
2009-04-09 00:27:15 +01:00
clear_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
2014-03-17 18:06:10 +01:00
smp_mb__after_atomic ( ) ;
2009-04-02 19:55:37 +01:00
queue_work ( md - > wq , & md - > work ) ;
2008-02-08 02:11:17 +00:00
}
2005-04-16 15:20:36 -07:00
/*
2009-12-10 23:52:24 +00:00
* Swap in a new table , returning the old one for the caller to destroy .
2005-04-16 15:20:36 -07:00
*/
2009-12-10 23:52:24 +00:00
struct dm_table * dm_swap_table ( struct mapped_device * md , struct dm_table * table )
2005-04-16 15:20:36 -07:00
{
2013-03-01 22:45:48 +00:00
struct dm_table * live_map = NULL , * map = ERR_PTR ( - EINVAL ) ;
2009-06-22 10:12:34 +01:00
struct queue_limits limits ;
2009-12-10 23:52:24 +00:00
int r ;
2005-04-16 15:20:36 -07:00
2008-02-08 02:10:08 +00:00
mutex_lock ( & md - > suspend_lock ) ;
2005-04-16 15:20:36 -07:00
/* device must be suspended */
2009-12-10 23:52:26 +00:00
if ( ! dm_suspended_md ( md ) )
2005-07-12 15:53:05 -07:00
goto out ;
2005-04-16 15:20:36 -07:00
2012-09-26 23:45:45 +01:00
/*
* If the new table has no data devices , retain the existing limits .
* This helps multipath with queue_if_no_path if all paths disappear ,
* then new I / O is queued based on these limits , and then some paths
* reappear .
*/
if ( dm_table_has_no_data_devices ( table ) ) {
2013-07-10 23:41:18 +01:00
live_map = dm_get_live_table_fast ( md ) ;
2012-09-26 23:45:45 +01:00
if ( live_map )
limits = md - > queue - > limits ;
2013-07-10 23:41:18 +01:00
dm_put_live_table_fast ( md ) ;
2012-09-26 23:45:45 +01:00
}
2013-03-01 22:45:48 +00:00
if ( ! live_map ) {
r = dm_calculate_queue_limits ( table , & limits ) ;
if ( r ) {
map = ERR_PTR ( r ) ;
goto out ;
}
2009-12-10 23:52:24 +00:00
}
2009-06-22 10:12:34 +01:00
2009-12-10 23:52:24 +00:00
map = __bind ( md , table , & limits ) ;
2005-04-16 15:20:36 -07:00
2005-07-12 15:53:05 -07:00
out :
2008-02-08 02:10:08 +00:00
mutex_unlock ( & md - > suspend_lock ) ;
2009-12-10 23:52:24 +00:00
return map ;
2005-04-16 15:20:36 -07:00
}
/*
* Functions to lock and unlock any filesystem running on the
* device .
*/
2005-07-28 21:16:00 -07:00
static int lock_fs ( struct mapped_device * md )
2005-04-16 15:20:36 -07:00
{
2006-01-06 00:20:05 -08:00
int r ;
2005-04-16 15:20:36 -07:00
WARN_ON ( md - > frozen_sb ) ;
2005-05-05 16:16:04 -07:00
2009-06-22 10:12:15 +01:00
md - > frozen_sb = freeze_bdev ( md - > bdev ) ;
2005-05-05 16:16:04 -07:00
if ( IS_ERR ( md - > frozen_sb ) ) {
2005-07-28 21:15:57 -07:00
r = PTR_ERR ( md - > frozen_sb ) ;
2006-01-06 00:20:05 -08:00
md - > frozen_sb = NULL ;
return r ;
2005-05-05 16:16:04 -07:00
}
2006-01-06 00:20:06 -08:00
set_bit ( DMF_FROZEN , & md - > flags ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2005-07-28 21:16:00 -07:00
static void unlock_fs ( struct mapped_device * md )
2005-04-16 15:20:36 -07:00
{
2006-01-06 00:20:06 -08:00
if ( ! test_bit ( DMF_FROZEN , & md - > flags ) )
return ;
2009-06-22 10:12:15 +01:00
thaw_bdev ( md - > bdev , md - > frozen_sb ) ;
2005-04-16 15:20:36 -07:00
md - > frozen_sb = NULL ;
2006-01-06 00:20:06 -08:00
clear_bit ( DMF_FROZEN , & md - > flags ) ;
2005-04-16 15:20:36 -07:00
}
/*
2014-10-28 18:34:52 -04:00
* If __dm_suspend returns 0 , the device is completely quiescent
* now . There is no request - processing activity . All new requests
* are being added to md - > deferred list .
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
*
2014-10-28 18:34:52 -04:00
* Caller must hold md - > suspend_lock
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
*/
2014-10-28 18:34:52 -04:00
static int __dm_suspend ( struct mapped_device * md , struct dm_table * map ,
unsigned suspend_flags , int interruptible )
2005-04-16 15:20:36 -07:00
{
2014-10-28 18:34:52 -04:00
bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG ;
bool noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG ;
int r ;
2005-04-16 15:20:36 -07:00
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
/*
* DMF_NOFLUSH_SUSPENDING must be set before presuspend .
* This flag is cleared before dm_suspend returns .
*/
if ( noflush )
set_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
2014-10-28 20:13:31 -04:00
/*
* This gets reverted if there ' s an error later and the targets
* provide the . presuspend_undo hook .
*/
2005-07-28 21:15:57 -07:00
dm_table_presuspend_targets ( map ) ;
2009-06-22 10:12:17 +01:00
/*
2009-12-10 23:52:16 +00:00
* Flush I / O to the device .
* Any I / O submitted after lock_fs ( ) may not be flushed .
* noflush takes precedence over do_lockfs .
* ( lock_fs ( ) flushes I / Os and waits for them to complete . )
2009-06-22 10:12:17 +01:00
*/
if ( ! noflush & & do_lockfs ) {
r = lock_fs ( md ) ;
2014-10-28 20:13:31 -04:00
if ( r ) {
dm_table_presuspend_undo_targets ( map ) ;
2014-10-28 18:34:52 -04:00
return r ;
2014-10-28 20:13:31 -04:00
}
2006-01-06 00:20:06 -08:00
}
2005-04-16 15:20:36 -07:00
/*
2009-04-09 00:27:15 +01:00
* Here we must make sure that no processes are submitting requests
* to target drivers i . e . no one may be executing
* __split_and_process_bio . This is called from dm_request and
* dm_wq_work .
*
* To get all processes out of __split_and_process_bio in dm_request ,
* we take the write lock . To prevent any process from reentering
2010-09-08 18:07:00 +02:00
* __split_and_process_bio from dm_request and quiesce the thread
* ( dm_wq_work ) , we set BMF_BLOCK_IO_FOR_SUSPEND and call
* flush_workqueue ( md - > wq ) .
2005-04-16 15:20:36 -07:00
*/
2009-04-09 00:27:14 +01:00
set_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
2014-11-05 14:35:50 +01:00
if ( map )
synchronize_srcu ( & md - > io_barrier ) ;
2005-04-16 15:20:36 -07:00
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:18 +00:00
/*
2010-09-08 18:07:00 +02:00
* Stop md - > queue before flushing md - > wq in case request - based
* dm defers requests to md - > wq from md - > queue .
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:18 +00:00
*/
2014-10-17 17:46:36 -06:00
if ( dm_request_based ( md ) ) {
2016-02-20 13:45:38 -05:00
dm_stop_queue ( md - > queue ) ;
2015-03-10 23:49:26 -04:00
if ( md - > kworker_task )
flush_kthread_worker ( & md - > kworker ) ;
2014-10-17 17:46:36 -06:00
}
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-10 23:52:18 +00:00
flush_workqueue ( md - > wq ) ;
2005-04-16 15:20:36 -07:00
/*
2009-04-09 00:27:15 +01:00
* At this point no more requests are entering target request routines .
* We call dm_wait_for_completion to wait for all existing requests
* to finish .
2005-04-16 15:20:36 -07:00
*/
2014-10-28 18:34:52 -04:00
r = dm_wait_for_completion ( md , interruptible ) ;
2005-04-16 15:20:36 -07:00
2008-02-08 02:10:22 +00:00
if ( noflush )
2009-04-02 19:55:39 +01:00
clear_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
2014-11-05 14:35:50 +01:00
if ( map )
synchronize_srcu ( & md - > io_barrier ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
2005-04-16 15:20:36 -07:00
/* were we interrupted ? */
2008-02-08 02:10:30 +00:00
if ( r < 0 ) {
2009-04-02 19:55:36 +01:00
dm_queue_flush ( md ) ;
2008-02-08 02:10:25 +00:00
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
if ( dm_request_based ( md ) )
2016-02-20 13:45:38 -05:00
dm_start_queue ( md - > queue ) ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
2005-07-28 21:16:00 -07:00
unlock_fs ( md ) ;
2014-10-28 20:13:31 -04:00
dm_table_presuspend_undo_targets ( map ) ;
2014-10-28 18:34:52 -04:00
/* pushback list is already flushed, so skip flush */
2005-07-28 21:16:00 -07:00
}
2005-04-16 15:20:36 -07:00
2014-10-28 18:34:52 -04:00
return r ;
}
/*
* We need to be able to change a mapping table under a mounted
* filesystem . For example we might want to move some data in
* the background . Before the table can be swapped with
* dm_bind_table , dm_suspend must be called to flush any in
* flight bios and ensure that any further io gets deferred .
*/
/*
* Suspend mechanism in request - based dm .
*
* 1. Flush all I / Os by lock_fs ( ) if needed .
* 2. Stop dispatching any I / O by stopping the request_queue .
* 3. Wait for all in - flight I / Os to be completed or requeued .
*
* To abort suspend , start the request_queue .
*/
int dm_suspend ( struct mapped_device * md , unsigned suspend_flags )
{
struct dm_table * map = NULL ;
int r = 0 ;
retry :
mutex_lock_nested ( & md - > suspend_lock , SINGLE_DEPTH_NESTING ) ;
if ( dm_suspended_md ( md ) ) {
r = - EINVAL ;
goto out_unlock ;
}
if ( dm_suspended_internally_md ( md ) ) {
/* already internally suspended, wait for internal resume */
mutex_unlock ( & md - > suspend_lock ) ;
r = wait_on_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY , TASK_INTERRUPTIBLE ) ;
if ( r )
return r ;
goto retry ;
}
2014-11-23 09:34:29 -08:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2014-10-28 18:34:52 -04:00
r = __dm_suspend ( md , map , suspend_flags , TASK_INTERRUPTIBLE ) ;
if ( r )
goto out_unlock ;
2009-04-09 00:27:15 +01:00
2005-07-28 21:16:00 -07:00
set_bit ( DMF_SUSPENDED , & md - > flags ) ;
2005-05-05 16:16:06 -07:00
2009-12-10 23:52:26 +00:00
dm_table_postsuspend_targets ( map ) ;
2006-11-08 17:44:43 -08:00
out_unlock :
2008-02-08 02:10:08 +00:00
mutex_unlock ( & md - > suspend_lock ) ;
2005-07-28 21:15:57 -07:00
return r ;
2005-04-16 15:20:36 -07:00
}
2014-10-28 18:34:52 -04:00
static int __dm_resume ( struct mapped_device * md , struct dm_table * map )
{
if ( map ) {
int r = dm_table_resume_targets ( map ) ;
if ( r )
return r ;
}
dm_queue_flush ( md ) ;
/*
* Flushing deferred I / Os must be done after targets are resumed
* so that mapping of targets can work correctly .
* Request - based dm is queueing the deferred I / Os in its request_queue .
*/
if ( dm_request_based ( md ) )
2016-02-20 13:45:38 -05:00
dm_start_queue ( md - > queue ) ;
2014-10-28 18:34:52 -04:00
unlock_fs ( md ) ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
int dm_resume ( struct mapped_device * md )
{
2005-07-28 21:15:57 -07:00
int r = - EINVAL ;
struct dm_table * map = NULL ;
2005-04-16 15:20:36 -07:00
2014-10-28 18:34:52 -04:00
retry :
mutex_lock_nested ( & md - > suspend_lock , SINGLE_DEPTH_NESTING ) ;
2009-12-10 23:52:26 +00:00
if ( ! dm_suspended_md ( md ) )
2005-07-28 21:15:57 -07:00
goto out ;
2014-10-28 18:34:52 -04:00
if ( dm_suspended_internally_md ( md ) ) {
/* already internally suspended, wait for internal resume */
mutex_unlock ( & md - > suspend_lock ) ;
r = wait_on_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY , TASK_INTERRUPTIBLE ) ;
if ( r )
return r ;
goto retry ;
}
2014-11-23 09:34:29 -08:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2005-07-28 21:16:00 -07:00
if ( ! map | | ! dm_table_get_size ( map ) )
2005-07-28 21:15:57 -07:00
goto out ;
2005-04-16 15:20:36 -07:00
2014-10-28 18:34:52 -04:00
r = __dm_resume ( md , map ) ;
2006-10-03 01:15:36 -07:00
if ( r )
goto out ;
2005-07-28 21:16:00 -07:00
clear_bit ( DMF_SUSPENDED , & md - > flags ) ;
2005-07-28 21:15:57 -07:00
r = 0 ;
out :
2008-02-08 02:10:08 +00:00
mutex_unlock ( & md - > suspend_lock ) ;
2005-07-28 21:16:00 -07:00
2005-07-28 21:15:57 -07:00
return r ;
2005-04-16 15:20:36 -07:00
}
2013-08-16 10:54:23 -04:00
/*
* Internal suspend / resume works like userspace - driven suspend . It waits
* until all bios finish and prevents issuing new bios to the target drivers .
* It may be used only from the kernel .
*/
2014-10-28 18:34:52 -04:00
static void __dm_internal_suspend ( struct mapped_device * md , unsigned suspend_flags )
2013-08-16 10:54:23 -04:00
{
2014-10-28 18:34:52 -04:00
struct dm_table * map = NULL ;
2015-01-08 18:52:26 -05:00
if ( md - > internal_suspend_count + + )
2014-10-28 18:34:52 -04:00
return ; /* nested internal suspend */
if ( dm_suspended_md ( md ) ) {
set_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
return ; /* nest suspend */
}
2014-11-23 09:34:29 -08:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2014-10-28 18:34:52 -04:00
/*
* Using TASK_UNINTERRUPTIBLE because only NOFLUSH internal suspend is
* supported . Properly supporting a TASK_INTERRUPTIBLE internal suspend
* would require changing . presuspend to return an error - - avoid this
* until there is a need for more elaborate variants of internal suspend .
*/
( void ) __dm_suspend ( md , map , suspend_flags , TASK_UNINTERRUPTIBLE ) ;
set_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
dm_table_postsuspend_targets ( map ) ;
}
static void __dm_internal_resume ( struct mapped_device * md )
{
2015-01-08 18:52:26 -05:00
BUG_ON ( ! md - > internal_suspend_count ) ;
if ( - - md - > internal_suspend_count )
2014-10-28 18:34:52 -04:00
return ; /* resume from nested internal suspend */
2013-08-16 10:54:23 -04:00
if ( dm_suspended_md ( md ) )
2014-10-28 18:34:52 -04:00
goto done ; /* resume from nested suspend */
/*
* NOTE : existing callers don ' t need to call dm_table_resume_targets
* ( which may fail - - so best to avoid it for now by passing NULL map )
*/
( void ) __dm_resume ( md , NULL ) ;
done :
clear_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
smp_mb__after_atomic ( ) ;
wake_up_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY ) ;
}
void dm_internal_suspend_noflush ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
__dm_internal_suspend ( md , DM_SUSPEND_NOFLUSH_FLAG ) ;
mutex_unlock ( & md - > suspend_lock ) ;
}
EXPORT_SYMBOL_GPL ( dm_internal_suspend_noflush ) ;
void dm_internal_resume ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
__dm_internal_resume ( md ) ;
mutex_unlock ( & md - > suspend_lock ) ;
}
EXPORT_SYMBOL_GPL ( dm_internal_resume ) ;
/*
* Fast variants of internal suspend / resume hold md - > suspend_lock ,
* which prevents interaction with userspace - driven suspend .
*/
void dm_internal_suspend_fast ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
if ( dm_suspended_md ( md ) | | dm_suspended_internally_md ( md ) )
2013-08-16 10:54:23 -04:00
return ;
set_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
synchronize_srcu ( & md - > io_barrier ) ;
flush_workqueue ( md - > wq ) ;
dm_wait_for_completion ( md , TASK_UNINTERRUPTIBLE ) ;
}
2015-02-26 11:40:35 -05:00
EXPORT_SYMBOL_GPL ( dm_internal_suspend_fast ) ;
2013-08-16 10:54:23 -04:00
2014-10-28 18:34:52 -04:00
void dm_internal_resume_fast ( struct mapped_device * md )
2013-08-16 10:54:23 -04:00
{
2014-10-28 18:34:52 -04:00
if ( dm_suspended_md ( md ) | | dm_suspended_internally_md ( md ) )
2013-08-16 10:54:23 -04:00
goto done ;
dm_queue_flush ( md ) ;
done :
mutex_unlock ( & md - > suspend_lock ) ;
}
2015-02-26 11:40:35 -05:00
EXPORT_SYMBOL_GPL ( dm_internal_resume_fast ) ;
2013-08-16 10:54:23 -04:00
2005-04-16 15:20:36 -07:00
/*-----------------------------------------------------------------
* Event notification .
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
2010-03-06 02:32:31 +00:00
int dm_kobject_uevent ( struct mapped_device * md , enum kobject_action action ,
2009-06-22 10:12:30 +01:00
unsigned cookie )
2007-12-13 14:15:57 +00:00
{
2009-06-22 10:12:30 +01:00
char udev_cookie [ DM_COOKIE_LENGTH ] ;
char * envp [ ] = { udev_cookie , NULL } ;
if ( ! cookie )
2010-03-06 02:32:31 +00:00
return kobject_uevent ( & disk_to_dev ( md - > disk ) - > kobj , action ) ;
2009-06-22 10:12:30 +01:00
else {
snprintf ( udev_cookie , DM_COOKIE_LENGTH , " %s=%u " ,
DM_COOKIE_ENV_VAR_NAME , cookie ) ;
2010-03-06 02:32:31 +00:00
return kobject_uevent_env ( & disk_to_dev ( md - > disk ) - > kobj ,
action , envp ) ;
2009-06-22 10:12:30 +01:00
}
2007-12-13 14:15:57 +00:00
}
2007-10-19 22:48:01 +01:00
uint32_t dm_next_uevent_seq ( struct mapped_device * md )
{
return atomic_add_return ( 1 , & md - > uevent_seq ) ;
}
2005-04-16 15:20:36 -07:00
uint32_t dm_get_event_nr ( struct mapped_device * md )
{
return atomic_read ( & md - > event_nr ) ;
}
int dm_wait_event ( struct mapped_device * md , int event_nr )
{
return wait_event_interruptible ( md - > eventq ,
( event_nr ! = atomic_read ( & md - > event_nr ) ) ) ;
}
2007-10-19 22:48:01 +01:00
void dm_uevent_add ( struct mapped_device * md , struct list_head * elist )
{
unsigned long flags ;
spin_lock_irqsave ( & md - > uevent_lock , flags ) ;
list_add ( elist , & md - > uevent_list ) ;
spin_unlock_irqrestore ( & md - > uevent_lock , flags ) ;
}
2005-04-16 15:20:36 -07:00
/*
* The gendisk is only valid as long as you have a reference
* count on ' md ' .
*/
struct gendisk * dm_disk ( struct mapped_device * md )
{
return md - > disk ;
}
2015-03-18 15:52:14 +00:00
EXPORT_SYMBOL_GPL ( dm_disk ) ;
2005-04-16 15:20:36 -07:00
2009-01-06 03:05:12 +00:00
struct kobject * dm_kobject ( struct mapped_device * md )
{
2014-01-13 19:37:54 -05:00
return & md - > kobj_holder . kobj ;
2009-01-06 03:05:12 +00:00
}
struct mapped_device * dm_get_from_kobject ( struct kobject * kobj )
{
struct mapped_device * md ;
2014-01-13 19:37:54 -05:00
md = container_of ( kobj , struct mapped_device , kobj_holder . kobj ) ;
2009-01-06 03:05:12 +00:00
2009-06-22 10:12:11 +01:00
if ( test_bit ( DMF_FREEING , & md - > flags ) | |
2009-12-10 23:52:20 +00:00
dm_deleting_md ( md ) )
2009-06-22 10:12:11 +01:00
return NULL ;
2009-01-06 03:05:12 +00:00
dm_get ( md ) ;
return md ;
}
2009-12-10 23:52:26 +00:00
int dm_suspended_md ( struct mapped_device * md )
2005-04-16 15:20:36 -07:00
{
return test_bit ( DMF_SUSPENDED , & md - > flags ) ;
}
2014-10-28 18:34:52 -04:00
int dm_suspended_internally_md ( struct mapped_device * md )
{
return test_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
}
2013-11-01 18:27:41 -04:00
int dm_test_deferred_remove_flag ( struct mapped_device * md )
{
return test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
}
2009-12-10 23:52:27 +00:00
int dm_suspended ( struct dm_target * ti )
{
dm table: remove dm_get from dm_table_get_md
Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
be called from presuspend/postsuspend, which are called while
mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
Justification for that is the lifetime of both objects: As far as the
current dm design/implementation, mapped_device is never freed while
targets are doing something, because dm core waits for targets to become
quiet in dm_put() using presuspend/postsuspend. So targets should be
able to touch mapped_device without holding reference count of the
mapped_device, and we should allow targets to touch mapped_device even
if it is in DMF_FREEING state.
Backgrounds:
I'm trying to remove the multipath internal queue, since dm core now has
a generic queue for request-based dm. In the patch-set, the multipath
target wants to request dm core to start/stop queue. One of such
start/stop requests can happen during postsuspend() while the target
waits for pg-init to complete, because the target stops queue when
starting pg-init and tries to restart it when completing pg-init. Since
queue belongs to mapped_device, it involves calling dm_table_get_md()
and dm_put(). On the other hand, postsuspend() is called in dm_put()
for mapped_device which is in DMF_FREEING state, and that triggers
BUG_ON(DMF_FREEING) in the 2nd dm_put().
I had tried to solve this problem by changing only multipath not to
touch mapped_device which is in DMF_FREEING state, but I couldn't and I
came up with a question why we need dm_get() in dm_table_get_md().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06 02:29:52 +00:00
return dm_suspended_md ( dm_table_get_md ( ti - > table ) ) ;
2009-12-10 23:52:27 +00:00
}
EXPORT_SYMBOL_GPL ( dm_suspended ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
int dm_noflush_suspending ( struct dm_target * ti )
{
dm table: remove dm_get from dm_table_get_md
Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
be called from presuspend/postsuspend, which are called while
mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
Justification for that is the lifetime of both objects: As far as the
current dm design/implementation, mapped_device is never freed while
targets are doing something, because dm core waits for targets to become
quiet in dm_put() using presuspend/postsuspend. So targets should be
able to touch mapped_device without holding reference count of the
mapped_device, and we should allow targets to touch mapped_device even
if it is in DMF_FREEING state.
Backgrounds:
I'm trying to remove the multipath internal queue, since dm core now has
a generic queue for request-based dm. In the patch-set, the multipath
target wants to request dm core to start/stop queue. One of such
start/stop requests can happen during postsuspend() while the target
waits for pg-init to complete, because the target stops queue when
starting pg-init and tries to restart it when completing pg-init. Since
queue belongs to mapped_device, it involves calling dm_table_get_md()
and dm_put(). On the other hand, postsuspend() is called in dm_put()
for mapped_device which is in DMF_FREEING state, and that triggers
BUG_ON(DMF_FREEING) in the 2nd dm_put().
I had tried to solve this problem by changing only multipath not to
touch mapped_device which is in DMF_FREEING state, but I couldn't and I
came up with a question why we need dm_get() in dm_table_get_md().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06 02:29:52 +00:00
return __noflush_suspending ( dm_table_get_md ( ti - > table ) ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 02:41:09 -08:00
}
EXPORT_SYMBOL_GPL ( dm_noflush_suspending ) ;
2015-06-26 10:01:13 -04:00
struct dm_md_mempools * dm_alloc_md_mempools ( struct mapped_device * md , unsigned type ,
2016-01-31 13:28:26 -05:00
unsigned integrity , unsigned per_io_data_size )
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
{
2016-02-22 12:16:21 -05:00
struct dm_md_mempools * pools = kzalloc_node ( sizeof ( * pools ) , GFP_KERNEL , md - > numa_node_id ) ;
2015-06-26 10:01:13 -04:00
struct kmem_cache * cachep = NULL ;
unsigned int pool_size = 0 ;
2013-03-01 22:45:48 +00:00
unsigned int front_pad ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
if ( ! pools )
2015-06-26 09:42:57 -04:00
return NULL ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
2015-06-26 10:01:13 -04:00
switch ( type ) {
case DM_TYPE_BIO_BASED :
2016-06-22 17:54:53 -06:00
case DM_TYPE_DAX_BIO_BASED :
2015-06-26 10:01:13 -04:00
cachep = _io_cache ;
pool_size = dm_get_reserved_bio_based_ios ( ) ;
2016-01-31 13:28:26 -05:00
front_pad = roundup ( per_io_data_size , __alignof__ ( struct dm_target_io ) ) + offsetof ( struct dm_target_io , clone ) ;
2015-06-26 10:01:13 -04:00
break ;
case DM_TYPE_REQUEST_BASED :
cachep = _rq_tio_cache ;
pool_size = dm_get_reserved_rq_based_ios ( ) ;
pools - > rq_pool = mempool_create_slab_pool ( pool_size , _rq_cache ) ;
if ( ! pools - > rq_pool )
goto out ;
/* fall through to setup remaining rq-based pools */
case DM_TYPE_MQ_REQUEST_BASED :
if ( ! pool_size )
pool_size = dm_get_reserved_rq_based_ios ( ) ;
front_pad = offsetof ( struct dm_rq_clone_bio_info , clone ) ;
2016-01-31 12:05:42 -05:00
/* per_io_data_size is used for blk-mq pdu at queue allocation */
2015-06-26 10:01:13 -04:00
break ;
default :
BUG ( ) ;
}
if ( cachep ) {
pools - > io_pool = mempool_create_slab_pool ( pool_size , cachep ) ;
if ( ! pools - > io_pool )
goto out ;
}
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
2014-10-03 11:55:26 +00:00
pools - > bs = bioset_create_nobvec ( pool_size , front_pad ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
if ( ! pools - > bs )
2013-03-01 22:45:48 +00:00
goto out ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
2011-03-17 11:11:05 +01:00
if ( integrity & & bioset_integrity_create ( pools - > bs , pool_size ) )
2013-03-01 22:45:48 +00:00
goto out ;
2011-03-17 11:11:05 +01:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
return pools ;
2015-05-22 09:14:04 -04:00
out :
dm_free_md_mempools ( pools ) ;
2015-06-26 10:01:13 -04:00
2015-06-26 09:42:57 -04:00
return NULL ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
}
void dm_free_md_mempools ( struct dm_md_mempools * pools )
{
if ( ! pools )
return ;
2015-09-13 14:15:05 +02:00
mempool_destroy ( pools - > io_pool ) ;
mempool_destroy ( pools - > rq_pool ) ;
2014-12-05 17:11:05 -05:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
if ( pools - > bs )
bioset_free ( pools - > bs ) ;
kfree ( pools ) ;
}
2016-07-08 21:23:51 +09:00
struct dm_pr {
u64 old_key ;
u64 new_key ;
u32 flags ;
bool fail_early ;
} ;
static int dm_call_pr ( struct block_device * bdev , iterate_devices_callout_fn fn ,
void * data )
2015-10-15 14:10:51 +02:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
2016-07-08 21:23:51 +09:00
struct dm_table * table ;
struct dm_target * ti ;
int ret = - ENOTTY , srcu_idx ;
2015-10-15 14:10:51 +02:00
2016-07-08 21:23:51 +09:00
table = dm_get_live_table ( md , & srcu_idx ) ;
if ( ! table | | ! dm_table_get_size ( table ) )
goto out ;
2015-10-15 14:10:51 +02:00
2016-07-08 21:23:51 +09:00
/* We only support devices that have a single target */
if ( dm_table_get_num_targets ( table ) ! = 1 )
goto out ;
ti = dm_table_get_target ( table , 0 ) ;
2015-10-15 14:10:51 +02:00
2016-07-08 21:23:51 +09:00
ret = - EINVAL ;
if ( ! ti - > type - > iterate_devices )
goto out ;
ret = ti - > type - > iterate_devices ( ti , fn , data ) ;
out :
dm_put_live_table ( md , srcu_idx ) ;
return ret ;
}
/*
* For register / unregister we need to manually call out to every path .
*/
static int __dm_pr_register ( struct dm_target * ti , struct dm_dev * dev ,
sector_t start , sector_t len , void * data )
{
struct dm_pr * pr = data ;
const struct pr_ops * ops = dev - > bdev - > bd_disk - > fops - > pr_ops ;
if ( ! ops | | ! ops - > pr_register )
return - EOPNOTSUPP ;
return ops - > pr_register ( dev - > bdev , pr - > old_key , pr - > new_key , pr - > flags ) ;
}
static int dm_pr_register ( struct block_device * bdev , u64 old_key , u64 new_key ,
u32 flags )
{
struct dm_pr pr = {
. old_key = old_key ,
. new_key = new_key ,
. flags = flags ,
. fail_early = true ,
} ;
int ret ;
ret = dm_call_pr ( bdev , __dm_pr_register , & pr ) ;
if ( ret & & new_key ) {
/* unregister all paths if we failed to register any path */
pr . old_key = new_key ;
pr . new_key = 0 ;
pr . flags = 0 ;
pr . fail_early = false ;
dm_call_pr ( bdev , __dm_pr_register , & pr ) ;
}
return ret ;
2015-10-15 14:10:51 +02:00
}
static int dm_pr_reserve ( struct block_device * bdev , u64 key , enum pr_type type ,
2016-02-18 16:13:51 -05:00
u32 flags )
2015-10-15 14:10:51 +02:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
fmode_t mode ;
2016-02-18 16:13:51 -05:00
int r ;
2015-10-15 14:10:51 +02:00
2016-02-18 16:13:51 -05:00
r = dm_grab_bdev_for_ioctl ( md , & bdev , & mode ) ;
2015-10-15 14:10:51 +02:00
if ( r < 0 )
return r ;
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_reserve )
r = ops - > pr_reserve ( bdev , key , type , flags ) ;
else
r = - EOPNOTSUPP ;
2016-02-18 16:13:51 -05:00
bdput ( bdev ) ;
2015-10-15 14:10:51 +02:00
return r ;
}
static int dm_pr_release ( struct block_device * bdev , u64 key , enum pr_type type )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
fmode_t mode ;
2016-02-18 16:13:51 -05:00
int r ;
2015-10-15 14:10:51 +02:00
2016-02-18 16:13:51 -05:00
r = dm_grab_bdev_for_ioctl ( md , & bdev , & mode ) ;
2015-10-15 14:10:51 +02:00
if ( r < 0 )
return r ;
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_release )
r = ops - > pr_release ( bdev , key , type ) ;
else
r = - EOPNOTSUPP ;
2016-02-18 16:13:51 -05:00
bdput ( bdev ) ;
2015-10-15 14:10:51 +02:00
return r ;
}
static int dm_pr_preempt ( struct block_device * bdev , u64 old_key , u64 new_key ,
2016-02-18 16:13:51 -05:00
enum pr_type type , bool abort )
2015-10-15 14:10:51 +02:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
fmode_t mode ;
2016-02-18 16:13:51 -05:00
int r ;
2015-10-15 14:10:51 +02:00
2016-02-18 16:13:51 -05:00
r = dm_grab_bdev_for_ioctl ( md , & bdev , & mode ) ;
2015-10-15 14:10:51 +02:00
if ( r < 0 )
return r ;
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_preempt )
r = ops - > pr_preempt ( bdev , old_key , new_key , type , abort ) ;
else
r = - EOPNOTSUPP ;
2016-02-18 16:13:51 -05:00
bdput ( bdev ) ;
2015-10-15 14:10:51 +02:00
return r ;
}
static int dm_pr_clear ( struct block_device * bdev , u64 key )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
fmode_t mode ;
2016-02-18 16:13:51 -05:00
int r ;
2015-10-15 14:10:51 +02:00
2016-02-18 16:13:51 -05:00
r = dm_grab_bdev_for_ioctl ( md , & bdev , & mode ) ;
2015-10-15 14:10:51 +02:00
if ( r < 0 )
return r ;
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_clear )
r = ops - > pr_clear ( bdev , key ) ;
else
r = - EOPNOTSUPP ;
2016-02-18 16:13:51 -05:00
bdput ( bdev ) ;
2015-10-15 14:10:51 +02:00
return r ;
}
static const struct pr_ops dm_pr_ops = {
. pr_register = dm_pr_register ,
. pr_reserve = dm_pr_reserve ,
. pr_release = dm_pr_release ,
. pr_preempt = dm_pr_preempt ,
. pr_clear = dm_pr_clear ,
} ;
2009-09-21 17:01:13 -07:00
static const struct block_device_operations dm_blk_dops = {
2005-04-16 15:20:36 -07:00
. open = dm_blk_open ,
. release = dm_blk_close ,
2006-10-03 01:15:15 -07:00
. ioctl = dm_blk_ioctl ,
2016-06-22 17:54:53 -06:00
. direct_access = dm_blk_direct_access ,
2006-03-27 01:17:54 -08:00
. getgeo = dm_blk_getgeo ,
2015-10-15 14:10:51 +02:00
. pr_ops = & dm_pr_ops ,
2005-04-16 15:20:36 -07:00
. owner = THIS_MODULE
} ;
/*
* module hooks
*/
module_init ( dm_init ) ;
module_exit ( dm_exit ) ;
module_param ( major , uint , 0 ) ;
MODULE_PARM_DESC ( major , " The major number of the device mapper " ) ;
2013-09-12 18:06:12 -04:00
2013-09-12 18:06:12 -04:00
module_param ( reserved_bio_based_ios , uint , S_IRUGO | S_IWUSR ) ;
MODULE_PARM_DESC ( reserved_bio_based_ios , " Reserved IOs in bio-based mempools " ) ;
2016-02-22 12:16:21 -05:00
module_param ( dm_numa_node , int , S_IRUGO | S_IWUSR ) ;
MODULE_PARM_DESC ( dm_numa_node , " NUMA node for DM device memory allocations " ) ;
2005-04-16 15:20:36 -07:00
MODULE_DESCRIPTION ( DM_NAME " driver " ) ;
MODULE_AUTHOR ( " Joe Thornber <dm-devel@redhat.com> " ) ;
MODULE_LICENSE ( " GPL " ) ;