2005-04-17 02:20:36 +04:00
/*
* Copyright ( C ) 2001 , 2002 Sistina Software ( UK ) Limited .
2009-01-06 06:05:12 +03:00
* Copyright ( C ) 2004 - 2008 Red Hat , Inc . All rights reserved .
2005-04-17 02:20:36 +04:00
*
* This file is released under the GPL .
*/
2016-05-12 23:28:10 +03:00
# include "dm-core.h"
# include "dm-rq.h"
2007-10-20 01:48:00 +04:00
# include "dm-uevent.h"
dm ima: measure data on table load
DM configures a block device with various target specific attributes
passed to it as a table. DM loads the table, and calls each target’s
respective constructors with the attributes as input parameters.
Some of these attributes are critical to ensure the device meets
certain security bar. Thus, IMA should measure these attributes, to
ensure they are not tampered with, during the lifetime of the device.
So that the external services can have high confidence in the
configuration of the block-devices on a given system.
Some devices may have large tables. And a given device may change its
state (table-load, suspend, resume, rename, remove, table-clear etc.)
many times. Measuring these attributes each time when the device
changes its state will significantly increase the size of the IMA logs.
Further, once configured, these attributes are not expected to change
unless a new table is loaded, or a device is removed and recreated.
Therefore the clear-text of the attributes should only be measured
during table load, and the hash of the active/inactive table should be
measured for the remaining device state changes.
Export IMA function ima_measure_critical_data() to allow measurement
of DM device parameters, as well as target specific attributes, during
table load. Compute the hash of the inactive table and store it for
measurements during future state change. If a load is called multiple
times, update the inactive table hash with the hash of the latest
populated table. So that the correct inactive table hash is measured
when the device transitions to different states like resume, remove,
rename, etc.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-07-13 03:48:58 +03:00
# include "dm-ima.h"
2005-04-17 02:20:36 +04:00
# include <linux/init.h>
# include <linux/module.h>
2006-03-27 13:18:20 +04:00
# include <linux/mutex.h>
2020-07-08 19:25:20 +03:00
# include <linux/sched/mm.h>
2017-02-02 21:15:33 +03:00
# include <linux/sched/signal.h>
2005-04-17 02:20:36 +04:00
# include <linux/blkpg.h>
# include <linux/bio.h>
# include <linux/mempool.h>
2017-04-12 22:35:44 +03:00
# include <linux/dax.h>
2005-04-17 02:20:36 +04:00
# include <linux/slab.h>
# include <linux/idr.h>
2017-05-29 22:57:56 +03:00
# include <linux/uio.h>
2006-03-27 13:17:54 +04:00
# include <linux/hdreg.h>
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
# include <linux/delay.h>
2014-10-29 01:34:52 +03:00
# include <linux/wait.h>
2015-10-15 15:10:51 +03:00
# include <linux/pr.h>
2017-10-20 10:37:39 +03:00
# include <linux/refcount.h>
2020-03-25 18:48:42 +03:00
# include <linux/part_stat.h>
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 03:37:18 +03:00
# include <linux/blk-crypto.h>
2021-10-18 21:04:51 +03:00
# include <linux/blk-crypto-profile.h>
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 09:43:05 +04:00
2006-06-26 11:27:35 +04:00
# define DM_MSG_PREFIX "core"
2009-06-22 13:12:30 +04:00
/*
* Cookies are numeric values sent with CHANGE and REMOVE
* uevents while resuming , removing or renaming the device .
*/
# define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
# define DM_COOKIE_LENGTH 24
2022-03-05 05:08:04 +03:00
/*
* For REQ_POLLED fs bio , this flag is set if we link mapped underlying
* dm_io into one list , and reuse bio - > bi_private as the list head . Before
* ending this fs bio , we will recover its - > bi_private .
*/
# define REQ_DM_POLL_LIST REQ_DRV
2005-04-17 02:20:36 +04:00
static const char * _name = DM_NAME ;
static unsigned int major = 0 ;
static unsigned int _major = 0 ;
2011-08-02 15:32:01 +04:00
static DEFINE_IDR ( _minor_idr ) ;
2006-06-26 11:27:22 +04:00
static DEFINE_SPINLOCK ( _minor_lock ) ;
2013-11-02 02:27:41 +04:00
static void do_deferred_remove ( struct work_struct * w ) ;
static DECLARE_WORK ( deferred_remove_work , do_deferred_remove ) ;
2014-06-14 21:44:31 +04:00
static struct workqueue_struct * deferred_remove_workqueue ;
2017-01-17 00:05:59 +03:00
atomic_t dm_global_event_nr = ATOMIC_INIT ( 0 ) ;
DECLARE_WAIT_QUEUE_HEAD ( dm_global_eventq ) ;
2017-09-20 14:29:49 +03:00
void dm_issue_global_event ( void )
{
atomic_inc ( & dm_global_event_nr ) ;
wake_up ( & dm_global_eventq ) ;
}
2022-03-26 21:14:00 +03:00
DEFINE_STATIC_KEY_FALSE ( stats_enabled ) ;
DEFINE_STATIC_KEY_FALSE ( swap_bios_enabled ) ;
DEFINE_STATIC_KEY_FALSE ( zoned_enabled ) ;
2005-04-17 02:20:36 +04:00
/*
2017-12-12 07:17:47 +03:00
* One of these is allocated ( on - stack ) per original bio .
2005-04-17 02:20:36 +04:00
*/
2017-12-12 07:17:47 +03:00
struct clone_info {
struct dm_table * map ;
struct bio * bio ;
struct dm_io * io ;
sector_t sector ;
unsigned sector_count ;
2022-04-17 20:00:15 +03:00
bool is_abnormal_io : 1 ;
bool submit_as_polled : 1 ;
2017-12-12 07:17:47 +03:00
} ;
2021-01-12 08:52:00 +03:00
# define DM_TARGET_IO_BIO_OFFSET (offsetof(struct dm_target_io, clone))
# define DM_IO_BIO_OFFSET \
( offsetof ( struct dm_target_io , clone ) + offsetof ( struct dm_io , tio ) )
2022-02-02 19:00:58 +03:00
static inline struct dm_target_io * clone_to_tio ( struct bio * clone )
{
return container_of ( clone , struct dm_target_io , clone ) ;
}
2017-12-12 07:17:47 +03:00
void * dm_per_bio_data ( struct bio * bio , size_t data_size )
{
2022-03-20 01:04:20 +03:00
if ( ! dm_tio_flagged ( clone_to_tio ( bio ) , DM_TIO_INSIDE_DM_IO ) )
2021-01-12 08:52:00 +03:00
return ( char * ) bio - DM_TARGET_IO_BIO_OFFSET - data_size ;
return ( char * ) bio - DM_IO_BIO_OFFSET - data_size ;
2017-12-12 07:17:47 +03:00
}
EXPORT_SYMBOL_GPL ( dm_per_bio_data ) ;
struct bio * dm_bio_from_per_bio_data ( void * data , size_t data_size )
{
struct dm_io * io = ( struct dm_io * ) ( ( char * ) data + data_size ) ;
if ( io - > magic = = DM_IO_MAGIC )
2021-01-12 08:52:00 +03:00
return ( struct bio * ) ( ( char * ) io + DM_IO_BIO_OFFSET ) ;
2017-12-12 07:17:47 +03:00
BUG_ON ( io - > magic ! = DM_TIO_MAGIC ) ;
2021-01-12 08:52:00 +03:00
return ( struct bio * ) ( ( char * ) io + DM_TARGET_IO_BIO_OFFSET ) ;
2017-12-12 07:17:47 +03:00
}
EXPORT_SYMBOL_GPL ( dm_bio_from_per_bio_data ) ;
unsigned dm_bio_get_target_bio_nr ( const struct bio * bio )
{
return container_of ( bio , struct dm_target_io , clone ) - > target_bio_nr ;
}
EXPORT_SYMBOL_GPL ( dm_bio_get_target_bio_nr ) ;
2006-06-26 11:27:21 +04:00
# define MINOR_ALLOCED ((void *)-1)
2016-02-22 20:16:21 +03:00
# define DM_NUMA_NODE NUMA_NO_NODE
static int dm_numa_node = DM_NUMA_NODE ;
2016-01-29 00:52:56 +03:00
2021-02-10 23:26:23 +03:00
# define DEFAULT_SWAP_BIOS (8 * 1048576 / PAGE_SIZE)
static int swap_bios = DEFAULT_SWAP_BIOS ;
static int get_swap_bios ( void )
{
int latch = READ_ONCE ( swap_bios ) ;
if ( unlikely ( latch < = 0 ) )
latch = DEFAULT_SWAP_BIOS ;
return latch ;
}
2014-08-13 22:53:43 +04:00
struct table_device {
struct list_head list ;
2017-10-20 10:37:39 +03:00
refcount_t count ;
2014-08-13 22:53:43 +04:00
struct dm_dev dm_dev ;
} ;
2013-09-13 02:06:12 +04:00
/*
* Bio - based DM ' s mempools ' reserved IOs set by the user .
*/
2016-05-12 23:28:10 +03:00
# define RESERVED_BIO_BASED_IOS 16
2013-09-13 02:06:12 +04:00
static unsigned reserved_bio_based_ios = RESERVED_BIO_BASED_IOS ;
2016-02-22 20:16:21 +03:00
static int __dm_get_module_param_int ( int * module_param , int min , int max )
{
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 00:07:29 +03:00
int param = READ_ONCE ( * module_param ) ;
2016-02-22 20:16:21 +03:00
int modified_param = 0 ;
bool modified = true ;
if ( param < min )
modified_param = min ;
else if ( param > max )
modified_param = max ;
else
modified = false ;
if ( modified ) {
( void ) cmpxchg ( module_param , param , modified_param ) ;
param = modified_param ;
}
return param ;
}
2016-05-12 23:28:10 +03:00
unsigned __dm_get_module_param ( unsigned * module_param ,
unsigned def , unsigned max )
2013-09-13 02:06:12 +04:00
{
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 00:07:29 +03:00
unsigned param = READ_ONCE ( * module_param ) ;
2015-02-28 06:25:26 +03:00
unsigned modified_param = 0 ;
2013-09-13 02:06:12 +04:00
2015-02-28 06:25:26 +03:00
if ( ! param )
modified_param = def ;
else if ( param > max )
modified_param = max ;
2013-09-13 02:06:12 +04:00
2015-02-28 06:25:26 +03:00
if ( modified_param ) {
( void ) cmpxchg ( module_param , param , modified_param ) ;
param = modified_param ;
2013-09-13 02:06:12 +04:00
}
2015-02-28 06:25:26 +03:00
return param ;
2013-09-13 02:06:12 +04:00
}
2013-09-13 02:06:12 +04:00
unsigned dm_get_reserved_bio_based_ios ( void )
{
2015-02-28 06:25:26 +03:00
return __dm_get_module_param ( & reserved_bio_based_ios ,
2016-05-12 23:28:10 +03:00
RESERVED_BIO_BASED_IOS , DM_RESERVED_MAX_IOS ) ;
2013-09-13 02:06:12 +04:00
}
EXPORT_SYMBOL_GPL ( dm_get_reserved_bio_based_ios ) ;
2016-02-22 20:16:21 +03:00
static unsigned dm_get_numa_node ( void )
{
return __dm_get_module_param_int ( & dm_numa_node ,
DM_NUMA_NODE , num_online_nodes ( ) - 1 ) ;
}
2005-04-17 02:20:36 +04:00
static int __init local_init ( void )
{
2019-02-20 23:37:44 +03:00
int r ;
2014-12-06 01:11:05 +03:00
2007-10-20 01:48:00 +04:00
r = dm_uevent_init ( ) ;
2008-10-21 20:45:08 +04:00
if ( r )
2019-02-20 23:37:44 +03:00
return r ;
2007-10-20 01:48:00 +04:00
2014-06-14 21:44:31 +04:00
deferred_remove_workqueue = alloc_workqueue ( " kdmremove " , WQ_UNBOUND , 1 ) ;
if ( ! deferred_remove_workqueue ) {
r = - ENOMEM ;
goto out_uevent_exit ;
}
2005-04-17 02:20:36 +04:00
_major = major ;
r = register_blkdev ( _major , _name ) ;
2008-10-21 20:45:08 +04:00
if ( r < 0 )
2014-06-14 21:44:31 +04:00
goto out_free_workqueue ;
2005-04-17 02:20:36 +04:00
if ( ! _major )
_major = r ;
return 0 ;
2008-10-21 20:45:08 +04:00
2014-06-14 21:44:31 +04:00
out_free_workqueue :
destroy_workqueue ( deferred_remove_workqueue ) ;
2008-10-21 20:45:08 +04:00
out_uevent_exit :
dm_uevent_exit ( ) ;
return r ;
2005-04-17 02:20:36 +04:00
}
static void local_exit ( void )
{
2013-11-02 02:27:41 +04:00
flush_scheduled_work ( ) ;
2014-06-14 21:44:31 +04:00
destroy_workqueue ( deferred_remove_workqueue ) ;
2013-11-02 02:27:41 +04:00
2007-07-17 15:03:46 +04:00
unregister_blkdev ( _major , _name ) ;
2007-10-20 01:48:00 +04:00
dm_uevent_exit ( ) ;
2005-04-17 02:20:36 +04:00
_major = 0 ;
DMINFO ( " cleaned up " ) ;
}
2008-02-08 05:09:51 +03:00
static int ( * _inits [ ] ) ( void ) __initdata = {
2005-04-17 02:20:36 +04:00
local_init ,
dm_target_init ,
dm_linear_init ,
dm_stripe_init ,
2009-12-11 02:51:57 +03:00
dm_io_init ,
2008-04-25 00:43:49 +04:00
dm_kcopyd_init ,
2005-04-17 02:20:36 +04:00
dm_interface_init ,
2013-08-16 18:54:23 +04:00
dm_statistics_init ,
2005-04-17 02:20:36 +04:00
} ;
2008-02-08 05:09:51 +03:00
static void ( * _exits [ ] ) ( void ) = {
2005-04-17 02:20:36 +04:00
local_exit ,
dm_target_exit ,
dm_linear_exit ,
dm_stripe_exit ,
2009-12-11 02:51:57 +03:00
dm_io_exit ,
2008-04-25 00:43:49 +04:00
dm_kcopyd_exit ,
2005-04-17 02:20:36 +04:00
dm_interface_exit ,
2013-08-16 18:54:23 +04:00
dm_statistics_exit ,
2005-04-17 02:20:36 +04:00
} ;
static int __init dm_init ( void )
{
const int count = ARRAY_SIZE ( _inits ) ;
int r , i ;
2021-08-14 00:37:59 +03:00
# if (IS_ENABLED(CONFIG_IMA) && !IS_ENABLED(CONFIG_IMA_DISABLE_HTABLE))
DMWARN ( " CONFIG_IMA_DISABLE_HTABLE is disabled. "
" Duplicate IMA measurements will not be recorded in the IMA log. " ) ;
# endif
2005-04-17 02:20:36 +04:00
for ( i = 0 ; i < count ; i + + ) {
r = _inits [ i ] ( ) ;
if ( r )
goto bad ;
}
return 0 ;
2021-08-14 00:37:59 +03:00
bad :
2005-04-17 02:20:36 +04:00
while ( i - - )
_exits [ i ] ( ) ;
return r ;
}
static void __exit dm_exit ( void )
{
int i = ARRAY_SIZE ( _exits ) ;
while ( i - - )
_exits [ i ] ( ) ;
2011-08-02 15:32:01 +04:00
/*
* Should be empty by this point .
*/
idr_destroy ( & _minor_idr ) ;
2005-04-17 02:20:36 +04:00
}
/*
* Block device functions
*/
2009-12-11 02:52:20 +03:00
int dm_deleting_md ( struct mapped_device * md )
{
return test_bit ( DMF_DELETING , & md - > flags ) ;
}
2008-03-02 18:29:31 +03:00
static int dm_blk_open ( struct block_device * bdev , fmode_t mode )
2005-04-17 02:20:36 +04:00
{
struct mapped_device * md ;
2006-06-26 11:27:23 +04:00
spin_lock ( & _minor_lock ) ;
2008-03-02 18:29:31 +03:00
md = bdev - > bd_disk - > private_data ;
2006-06-26 11:27:23 +04:00
if ( ! md )
goto out ;
2006-06-26 11:27:34 +04:00
if ( test_bit ( DMF_FREEING , & md - > flags ) | |
2009-12-11 02:52:20 +03:00
dm_deleting_md ( md ) ) {
2006-06-26 11:27:23 +04:00
md = NULL ;
goto out ;
}
2005-04-17 02:20:36 +04:00
dm_get ( md ) ;
2006-06-26 11:27:34 +04:00
atomic_inc ( & md - > open_count ) ;
2006-06-26 11:27:23 +04:00
out :
spin_unlock ( & _minor_lock ) ;
return md ? 0 : - ENXIO ;
2005-04-17 02:20:36 +04:00
}
2013-05-06 05:52:57 +04:00
static void dm_blk_close ( struct gendisk * disk , fmode_t mode )
2005-04-17 02:20:36 +04:00
{
2015-03-24 00:01:43 +03:00
struct mapped_device * md ;
2010-08-07 20:25:34 +04:00
2011-01-13 22:59:48 +03:00
spin_lock ( & _minor_lock ) ;
2015-03-24 00:01:43 +03:00
md = disk - > private_data ;
if ( WARN_ON ( ! md ) )
goto out ;
2013-11-02 02:27:41 +04:00
if ( atomic_dec_and_test ( & md - > open_count ) & &
( test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ) )
2014-06-14 21:44:31 +04:00
queue_work ( deferred_remove_workqueue , & deferred_remove_work ) ;
2013-11-02 02:27:41 +04:00
2005-04-17 02:20:36 +04:00
dm_put ( md ) ;
2015-03-24 00:01:43 +03:00
out :
2011-01-13 22:59:48 +03:00
spin_unlock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
}
2006-06-26 11:27:34 +04:00
int dm_open_count ( struct mapped_device * md )
{
return atomic_read ( & md - > open_count ) ;
}
/*
* Guarantees nothing is using the device before it ' s deleted .
*/
2013-11-02 02:27:41 +04:00
int dm_lock_for_deletion ( struct mapped_device * md , bool mark_deferred , bool only_deferred )
2006-06-26 11:27:34 +04:00
{
int r = 0 ;
spin_lock ( & _minor_lock ) ;
2013-11-02 02:27:41 +04:00
if ( dm_open_count ( md ) ) {
2006-06-26 11:27:34 +04:00
r = - EBUSY ;
2013-11-02 02:27:41 +04:00
if ( mark_deferred )
set_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
} else if ( only_deferred & & ! test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) )
r = - EEXIST ;
2006-06-26 11:27:34 +04:00
else
set_bit ( DMF_DELETING , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
return r ;
}
2013-11-02 02:27:41 +04:00
int dm_cancel_deferred_remove ( struct mapped_device * md )
{
int r = 0 ;
spin_lock ( & _minor_lock ) ;
if ( test_bit ( DMF_DELETING , & md - > flags ) )
r = - EBUSY ;
else
clear_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
return r ;
}
static void do_deferred_remove ( struct work_struct * w )
{
dm_deferred_remove ( ) ;
}
2006-03-27 13:17:54 +04:00
static int dm_blk_getgeo ( struct block_device * bdev , struct hd_geometry * geo )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
return dm_get_geometry ( md , geo ) ;
}
2018-04-03 22:05:12 +03:00
static int dm_prepare_ioctl ( struct mapped_device * md , int * srcu_idx ,
2018-04-03 23:54:10 +03:00
struct block_device * * bdev )
2006-10-03 12:15:15 +04:00
{
2016-02-18 23:44:39 +03:00
struct dm_target * tgt ;
2013-07-11 02:41:15 +04:00
struct dm_table * map ;
2018-04-03 22:05:12 +03:00
int r ;
2006-10-03 12:15:15 +04:00
2013-07-11 02:41:15 +04:00
retry :
2015-10-15 15:10:50 +03:00
r = - ENOTTY ;
2018-04-03 22:05:12 +03:00
map = dm_get_live_table ( md , srcu_idx ) ;
2006-10-03 12:15:15 +04:00
if ( ! map | | ! dm_table_get_size ( map ) )
2018-04-03 22:05:12 +03:00
return r ;
2006-10-03 12:15:15 +04:00
/* We only support devices that have a single target */
if ( dm_table_get_num_targets ( map ) ! = 1 )
2018-04-03 22:05:12 +03:00
return r ;
2006-10-03 12:15:15 +04:00
2016-02-18 23:44:39 +03:00
tgt = dm_table_get_target ( map , 0 ) ;
if ( ! tgt - > type - > prepare_ioctl )
2018-04-03 22:05:12 +03:00
return r ;
2018-02-22 21:31:20 +03:00
2018-04-03 22:05:12 +03:00
if ( dm_suspended_md ( md ) )
return - EAGAIN ;
2006-10-03 12:15:15 +04:00
2018-04-03 23:54:10 +03:00
r = tgt - > type - > prepare_ioctl ( tgt , bdev ) ;
2015-11-17 12:39:26 +03:00
if ( r = = - ENOTCONN & & ! fatal_signal_pending ( current ) ) {
2018-04-03 22:05:12 +03:00
dm_put_live_table ( md , * srcu_idx ) ;
2013-07-11 02:41:15 +04:00
msleep ( 10 ) ;
goto retry ;
}
2018-04-03 22:05:12 +03:00
2015-10-15 15:10:50 +03:00
return r ;
}
2018-04-03 22:05:12 +03:00
static void dm_unprepare_ioctl ( struct mapped_device * md , int srcu_idx )
{
dm_put_live_table ( md , srcu_idx ) ;
}
2015-10-15 15:10:50 +03:00
static int dm_blk_ioctl ( struct block_device * bdev , fmode_t mode ,
unsigned int cmd , unsigned long arg )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
2018-04-03 22:05:12 +03:00
int r , srcu_idx ;
2015-10-15 15:10:50 +03:00
2018-04-03 23:54:10 +03:00
r = dm_prepare_ioctl ( md , & srcu_idx , & bdev ) ;
2015-10-15 15:10:50 +03:00
if ( r < 0 )
2018-04-03 22:05:12 +03:00
goto out ;
2013-07-11 02:41:15 +04:00
2015-10-15 15:10:50 +03:00
if ( r > 0 ) {
/*
2017-02-04 12:45:03 +03:00
* Target determined this ioctl is being issued against a
* subset of the parent bdev ; require extra privileges .
2015-10-15 15:10:50 +03:00
*/
2017-02-04 12:45:03 +03:00
if ( ! capable ( CAP_SYS_RAWIO ) ) {
2021-01-07 02:19:05 +03:00
DMDEBUG_LIMIT (
2017-02-04 12:45:03 +03:00
" %s: sending ioctl %x to DM device without required privilege. " ,
current - > comm , cmd ) ;
r = - ENOIOCTLCMD ;
2015-10-15 15:10:50 +03:00
goto out ;
2017-02-04 12:45:03 +03:00
}
2015-10-15 15:10:50 +03:00
}
2013-07-11 02:41:15 +04:00
2020-11-03 13:00:18 +03:00
if ( ! bdev - > bd_disk - > fops - > ioctl )
r = - ENOTTY ;
else
r = bdev - > bd_disk - > fops - > ioctl ( bdev , mode , cmd , arg ) ;
2015-10-15 15:10:50 +03:00
out :
2018-04-03 22:05:12 +03:00
dm_unprepare_ioctl ( md , srcu_idx ) ;
2006-10-03 12:15:15 +04:00
return r ;
}
2020-09-17 19:59:36 +03:00
u64 dm_start_time_ns_from_clone ( struct bio * bio )
{
2022-02-02 19:00:58 +03:00
return jiffies_to_nsecs ( clone_to_tio ( bio ) - > io - > start_time ) ;
2020-09-17 19:59:36 +03:00
}
EXPORT_SYMBOL_GPL ( dm_start_time_ns_from_clone ) ;
2022-02-18 07:39:57 +03:00
static bool bio_is_flush_with_data ( struct bio * bio )
2020-09-17 19:59:36 +03:00
{
2022-02-18 07:39:57 +03:00
return ( ( bio - > bi_opf & REQ_PREFLUSH ) & & bio - > bi_iter . bi_size ) ;
}
2022-04-12 11:56:11 +03:00
static void dm_io_acct ( struct dm_io * io , bool end )
2022-02-18 07:39:57 +03:00
{
2022-04-12 11:56:11 +03:00
struct dm_stats_aux * stats_aux = & io - > stats_aux ;
unsigned long start_time = io - > start_time ;
struct mapped_device * md = io - > md ;
struct bio * bio = io - > orig_bio ;
2022-04-12 11:56:12 +03:00
unsigned int sectors ;
/*
* If REQ_PREFLUSH set , don ' t account payload , it will be
* submitted ( and accounted ) after this flush completes .
*/
if ( bio_is_flush_with_data ( bio ) )
sectors = 0 ;
2022-04-12 11:56:13 +03:00
else if ( likely ( ! ( dm_io_flagged ( io , DM_IO_WAS_SPLIT ) ) ) )
2022-04-12 11:56:12 +03:00
sectors = bio_sectors ( bio ) ;
2022-04-12 11:56:13 +03:00
else
sectors = io - > sectors ;
2022-02-18 07:39:57 +03:00
if ( ! end )
2022-04-12 11:56:12 +03:00
bdev_start_io_acct ( bio - > bi_bdev , sectors , bio_op ( bio ) ,
start_time ) ;
2022-02-18 07:39:57 +03:00
else
2022-04-12 11:56:12 +03:00
bdev_end_io_acct ( bio - > bi_bdev , bio_op ( bio ) , start_time ) ;
2020-09-17 19:59:36 +03:00
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & stats_enabled ) & &
2022-04-12 11:56:13 +03:00
unlikely ( dm_stats_used ( & md - > stats ) ) ) {
sector_t sector ;
if ( likely ( ! dm_io_flagged ( io , DM_IO_WAS_SPLIT ) ) )
sector = bio - > bi_iter . bi_sector ;
else
sector = bio_end_sector ( bio ) - io - > sector_offset ;
2020-09-17 19:59:36 +03:00
dm_stats_account_io ( & md - > stats , bio_data_dir ( bio ) ,
2022-04-12 11:56:13 +03:00
sector , sectors ,
2022-02-18 07:39:57 +03:00
end , start_time , stats_aux ) ;
2022-04-12 11:56:13 +03:00
}
2022-02-18 07:39:57 +03:00
}
2022-04-12 11:56:10 +03:00
static void __dm_start_io_acct ( struct dm_io * io )
2022-02-18 07:39:57 +03:00
{
2022-04-12 11:56:11 +03:00
dm_io_acct ( io , false ) ;
2020-09-17 19:59:36 +03:00
}
2022-02-18 07:40:32 +03:00
static void dm_start_io_acct ( struct dm_io * io , struct bio * clone )
2020-09-17 19:59:36 +03:00
{
2022-02-18 07:40:32 +03:00
/*
* Ensure IO accounting is only ever started once .
*/
2022-03-25 21:12:47 +03:00
if ( dm_io_flagged ( io , DM_IO_ACCOUNTED ) )
return ;
/* Expect no possibility for race unless DM_TIO_IS_DUPLICATE_BIO. */
if ( ! clone | | likely ( dm_tio_is_normal ( clone_to_tio ( clone ) ) ) ) {
2022-03-18 07:15:28 +03:00
dm_io_set_flag ( io , DM_IO_ACCOUNTED ) ;
} else {
unsigned long flags ;
2022-03-20 01:04:20 +03:00
/* Can afford locking given DM_TIO_IS_DUPLICATE_BIO */
2022-03-20 01:41:16 +03:00
spin_lock_irqsave ( & io - > lock , flags ) ;
2022-03-18 07:15:28 +03:00
dm_io_set_flag ( io , DM_IO_ACCOUNTED ) ;
2022-03-20 01:41:16 +03:00
spin_unlock_irqrestore ( & io - > lock , flags ) ;
2022-03-18 07:15:28 +03:00
}
2020-09-17 19:59:36 +03:00
2022-04-12 11:56:10 +03:00
__dm_start_io_acct ( io ) ;
2022-02-18 07:40:32 +03:00
}
2020-09-17 19:59:36 +03:00
2022-04-12 11:56:10 +03:00
static void dm_end_io_acct ( struct dm_io * io )
2022-02-18 07:40:32 +03:00
{
2022-04-12 11:56:11 +03:00
dm_io_acct ( io , true ) ;
2020-09-17 19:59:36 +03:00
}
2017-12-09 23:16:42 +03:00
static struct dm_io * alloc_io ( struct mapped_device * md , struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2017-12-12 07:17:47 +03:00
struct dm_io * io ;
struct dm_target_io * tio ;
struct bio * clone ;
2022-06-08 09:34:06 +03:00
clone = bio_alloc_clone ( NULL , bio , GFP_NOIO , & md - > mempools - > io_bs ) ;
2022-05-11 16:38:38 +03:00
/* Set default bdev, but target must bio_set_dev() before issuing IO */
clone - > bi_bdev = md - > disk - > part0 ;
2017-12-12 07:17:47 +03:00
2022-02-02 19:00:58 +03:00
tio = clone_to_tio ( clone ) ;
2022-03-20 01:04:20 +03:00
tio - > flags = 0 ;
dm_tio_set_flag ( tio , DM_TIO_INSIDE_DM_IO ) ;
2017-12-12 07:17:47 +03:00
tio - > io = NULL ;
io = container_of ( tio , struct dm_io , tio ) ;
io - > magic = DM_IO_MAGIC ;
2022-03-17 20:52:06 +03:00
io - > status = BLK_STS_OK ;
2022-04-12 11:56:15 +03:00
/* one ref is for submission, the other is for completion */
atomic_set ( & io - > io_count , 2 ) ;
2022-02-18 07:40:02 +03:00
this_cpu_inc ( * md - > pending_io ) ;
2022-04-12 11:56:13 +03:00
io - > orig_bio = bio ;
2017-12-09 23:16:42 +03:00
io - > md = md ;
2022-03-20 01:41:16 +03:00
spin_lock_init ( & io - > lock ) ;
2022-01-28 18:58:41 +03:00
io - > start_time = jiffies ;
2022-03-18 07:15:28 +03:00
io - > flags = 0 ;
2017-12-12 07:17:47 +03:00
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & stats_enabled ) )
dm_stats_record_start ( & md - > stats , & io - > stats_aux ) ;
2017-12-12 07:17:47 +03:00
return io ;
2005-04-17 02:20:36 +04:00
}
2022-02-18 07:40:18 +03:00
static void free_io ( struct dm_io * io )
2005-04-17 02:20:36 +04:00
{
2017-12-12 07:17:47 +03:00
bio_put ( & io - > tio . clone ) ;
}
2022-02-02 19:01:03 +03:00
static struct bio * alloc_tio ( struct clone_info * ci , struct dm_target * ti ,
2022-05-11 16:38:38 +03:00
unsigned target_bio_nr , unsigned * len , gfp_t gfp_mask )
2017-12-12 07:17:47 +03:00
{
struct dm_target_io * tio ;
2022-02-18 07:40:25 +03:00
struct bio * clone ;
2017-12-12 07:17:47 +03:00
if ( ! ci - > io - > tio . io ) {
/* the dm_target_io embedded in ci->io is available */
tio = & ci - > io - > tio ;
2022-02-18 07:40:25 +03:00
/* alloc_io() already initialized embedded clone */
clone = & tio - > clone ;
2017-12-12 07:17:47 +03:00
} else {
2022-05-11 16:38:38 +03:00
struct mapped_device * md = ci - > io - > md ;
2022-06-08 09:34:06 +03:00
clone = bio_alloc_clone ( NULL , ci - > bio , gfp_mask ,
& md - > mempools - > bs ) ;
2017-12-12 07:17:47 +03:00
if ( ! clone )
return NULL ;
2022-05-11 16:38:38 +03:00
/* Set default bdev, but target must bio_set_dev() before issuing IO */
clone - > bi_bdev = md - > disk - > part0 ;
2017-12-12 07:17:47 +03:00
2022-03-05 05:08:04 +03:00
/* REQ_DM_POLL_LIST shouldn't be inherited */
clone - > bi_opf & = ~ REQ_DM_POLL_LIST ;
2022-02-02 19:00:58 +03:00
tio = clone_to_tio ( clone ) ;
2022-03-20 01:04:20 +03:00
tio - > flags = 0 ; /* also clears DM_TIO_INSIDE_DM_IO */
2017-12-12 07:17:47 +03:00
}
tio - > magic = DM_TIO_MAGIC ;
tio - > io = ci - > io ;
tio - > ti = ti ;
tio - > target_bio_nr = target_bio_nr ;
2022-02-02 19:01:01 +03:00
tio - > len_ptr = len ;
2022-02-18 07:40:23 +03:00
tio - > old_sector = 0 ;
2017-12-12 07:17:47 +03:00
2022-02-18 07:40:25 +03:00
if ( len ) {
clone - > bi_iter . bi_size = to_bytes ( * len ) ;
if ( bio_integrity ( clone ) )
bio_integrity_trim ( clone ) ;
}
2017-12-12 07:17:47 +03:00
2022-02-18 07:40:25 +03:00
return clone ;
2005-04-17 02:20:36 +04:00
}
2022-02-02 19:01:03 +03:00
static void free_tio ( struct bio * clone )
2005-04-17 02:20:36 +04:00
{
2022-03-20 01:04:20 +03:00
if ( dm_tio_flagged ( clone_to_tio ( clone ) , DM_TIO_INSIDE_DM_IO ) )
2017-12-12 07:17:47 +03:00
return ;
2022-02-02 19:01:03 +03:00
bio_put ( clone ) ;
2005-04-17 02:20:36 +04:00
}
/*
* Add the bio to the list of deferred io .
*/
2009-04-09 03:27:15 +04:00
static void queue_io ( struct mapped_device * md , struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2010-09-08 20:07:01 +04:00
unsigned long flags ;
2005-04-17 02:20:36 +04:00
2010-09-08 20:07:01 +04:00
spin_lock_irqsave ( & md - > deferred_lock , flags ) ;
2005-04-17 02:20:36 +04:00
bio_list_add ( & md - > deferred , bio ) ;
2010-09-08 20:07:01 +04:00
spin_unlock_irqrestore ( & md - > deferred_lock , flags ) ;
2010-09-08 20:07:00 +04:00
queue_work ( md - > wq , & md - > work ) ;
2005-04-17 02:20:36 +04:00
}
/*
* Everyone ( including functions in this file ) , should use this
* function to access the md - > map field , and make sure they call
2013-07-11 02:41:18 +04:00
* dm_put_live_table ( ) when finished .
2005-04-17 02:20:36 +04:00
*/
2022-03-27 04:08:36 +03:00
struct dm_table * dm_get_live_table ( struct mapped_device * md ,
int * srcu_idx ) __acquires ( md - > io_barrier )
2005-04-17 02:20:36 +04:00
{
2013-07-11 02:41:18 +04:00
* srcu_idx = srcu_read_lock ( & md - > io_barrier ) ;
return srcu_dereference ( md - > map , & md - > io_barrier ) ;
}
2005-04-17 02:20:36 +04:00
2022-03-27 04:08:36 +03:00
void dm_put_live_table ( struct mapped_device * md ,
int srcu_idx ) __releases ( md - > io_barrier )
2013-07-11 02:41:18 +04:00
{
srcu_read_unlock ( & md - > io_barrier , srcu_idx ) ;
}
void dm_sync_table ( struct mapped_device * md )
{
synchronize_srcu ( & md - > io_barrier ) ;
synchronize_rcu_expedited ( ) ;
}
/*
* A fast alternative to dm_get_live_table / dm_put_live_table .
* The caller must not block between these two functions .
*/
static struct dm_table * dm_get_live_table_fast ( struct mapped_device * md ) __acquires ( RCU )
{
rcu_read_lock ( ) ;
return rcu_dereference ( md - > map ) ;
}
2005-04-17 02:20:36 +04:00
2013-07-11 02:41:18 +04:00
static void dm_put_live_table_fast ( struct mapped_device * md ) __releases ( RCU )
{
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
}
2022-03-27 04:08:36 +03:00
static inline struct dm_table * dm_get_live_table_bio ( struct mapped_device * md ,
int * srcu_idx , struct bio * bio )
{
if ( bio - > bi_opf & REQ_NOWAIT )
return dm_get_live_table_fast ( md ) ;
else
return dm_get_live_table ( md , srcu_idx ) ;
}
static inline void dm_put_live_table_bio ( struct mapped_device * md , int srcu_idx ,
struct bio * bio )
{
if ( bio - > bi_opf & REQ_NOWAIT )
dm_put_live_table_fast ( md ) ;
else
dm_put_live_table ( md , srcu_idx ) ;
}
2018-04-03 22:05:12 +03:00
static char * _dm_claim_ptr = " I belong to device-mapper " ;
2014-08-13 22:53:43 +04:00
/*
* Open a table device so we can use it as a map destination .
*/
static int open_table_device ( struct table_device * td , dev_t dev ,
struct mapped_device * md )
{
struct block_device * bdev ;
2021-11-29 13:21:59 +03:00
u64 part_off ;
2014-08-13 22:53:43 +04:00
int r ;
BUG_ON ( td - > dm_dev . bdev ) ;
2018-02-22 21:31:20 +03:00
bdev = blkdev_get_by_dev ( dev , td - > dm_dev . mode | FMODE_EXCL , _dm_claim_ptr ) ;
2014-08-13 22:53:43 +04:00
if ( IS_ERR ( bdev ) )
return PTR_ERR ( bdev ) ;
r = bd_link_disk_holder ( bdev , dm_disk ( md ) ) ;
if ( r ) {
blkdev_put ( bdev , td - > dm_dev . mode | FMODE_EXCL ) ;
return r ;
}
td - > dm_dev . bdev = bdev ;
2021-11-29 13:21:59 +03:00
td - > dm_dev . dax_dev = fs_dax_get_by_bdev ( bdev , & part_off ) ;
2014-08-13 22:53:43 +04:00
return 0 ;
}
/*
* Close a table device that we ' ve been using .
*/
static void close_table_device ( struct table_device * td , struct mapped_device * md )
{
if ( ! td - > dm_dev . bdev )
return ;
bd_unlink_disk_holder ( td - > dm_dev . bdev , dm_disk ( md ) ) ;
blkdev_put ( td - > dm_dev . bdev , td - > dm_dev . mode | FMODE_EXCL ) ;
2017-04-12 23:37:44 +03:00
put_dax ( td - > dm_dev . dax_dev ) ;
2014-08-13 22:53:43 +04:00
td - > dm_dev . bdev = NULL ;
2017-04-12 23:37:44 +03:00
td - > dm_dev . dax_dev = NULL ;
2014-08-13 22:53:43 +04:00
}
static struct table_device * find_table_device ( struct list_head * l , dev_t dev ,
2019-05-10 20:48:37 +03:00
fmode_t mode )
{
2014-08-13 22:53:43 +04:00
struct table_device * td ;
list_for_each_entry ( td , l , list )
if ( td - > dm_dev . bdev - > bd_dev = = dev & & td - > dm_dev . mode = = mode )
return td ;
return NULL ;
}
int dm_get_table_device ( struct mapped_device * md , dev_t dev , fmode_t mode ,
2019-05-10 20:48:37 +03:00
struct dm_dev * * result )
{
2014-08-13 22:53:43 +04:00
int r ;
struct table_device * td ;
mutex_lock ( & md - > table_devices_lock ) ;
td = find_table_device ( & md - > table_devices , dev , mode ) ;
if ( ! td ) {
2016-02-22 20:16:21 +03:00
td = kmalloc_node ( sizeof ( * td ) , GFP_KERNEL , md - > numa_node_id ) ;
2014-08-13 22:53:43 +04:00
if ( ! td ) {
mutex_unlock ( & md - > table_devices_lock ) ;
return - ENOMEM ;
}
td - > dm_dev . mode = mode ;
td - > dm_dev . bdev = NULL ;
if ( ( r = open_table_device ( td , dev , md ) ) ) {
mutex_unlock ( & md - > table_devices_lock ) ;
kfree ( td ) ;
return r ;
}
format_dev_t ( td - > dm_dev . name , dev ) ;
2017-10-20 10:37:39 +03:00
refcount_set ( & td - > count , 1 ) ;
2014-08-13 22:53:43 +04:00
list_add ( & td - > list , & md - > table_devices ) ;
2017-10-20 10:37:39 +03:00
} else {
refcount_inc ( & td - > count ) ;
2014-08-13 22:53:43 +04:00
}
mutex_unlock ( & md - > table_devices_lock ) ;
* result = & td - > dm_dev ;
return 0 ;
}
void dm_put_table_device ( struct mapped_device * md , struct dm_dev * d )
{
struct table_device * td = container_of ( d , struct table_device , dm_dev ) ;
mutex_lock ( & md - > table_devices_lock ) ;
2017-10-20 10:37:39 +03:00
if ( refcount_dec_and_test ( & td - > count ) ) {
2014-08-13 22:53:43 +04:00
close_table_device ( td , md ) ;
list_del ( & td - > list ) ;
kfree ( td ) ;
}
mutex_unlock ( & md - > table_devices_lock ) ;
}
static void free_table_devices ( struct list_head * devices )
{
struct list_head * tmp , * next ;
list_for_each_safe ( tmp , next , devices ) {
struct table_device * td = list_entry ( tmp , struct table_device , list ) ;
DMWARN ( " dm_destroy: %s still exists with %d references " ,
2017-10-20 10:37:39 +03:00
td - > dm_dev . name , refcount_read ( & td - > count ) ) ;
2014-08-13 22:53:43 +04:00
kfree ( td ) ;
}
}
2006-03-27 13:17:54 +04:00
/*
* Get the geometry associated with a dm device
*/
int dm_get_geometry ( struct mapped_device * md , struct hd_geometry * geo )
{
* geo = md - > geometry ;
return 0 ;
}
/*
* Set the geometry of a device .
*/
int dm_set_geometry ( struct mapped_device * md , struct hd_geometry * geo )
{
sector_t sz = ( sector_t ) geo - > cylinders * geo - > heads * geo - > sectors ;
if ( geo - > start > sz ) {
DMWARN ( " Start sector is beyond the geometry limits. " ) ;
return - EINVAL ;
}
md - > geometry = * geo ;
return 0 ;
}
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
static int __noflush_suspending ( struct mapped_device * md )
{
return test_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
}
2022-03-10 00:07:06 +03:00
static void dm_io_complete ( struct dm_io * io )
2005-04-17 02:20:36 +04:00
{
2017-06-03 10:38:06 +03:00
blk_status_t io_error ;
2009-03-16 20:44:36 +03:00
struct mapped_device * md = io - > md ;
2022-03-10 00:07:06 +03:00
struct bio * bio = io - > orig_bio ;
2005-04-17 02:20:36 +04:00
2022-03-10 00:07:06 +03:00
if ( io - > status = = BLK_STS_DM_REQUEUE ) {
unsigned long flags ;
/*
* Target requested pushing back the I / O .
*/
spin_lock_irqsave ( & md - > deferred_lock , flags ) ;
if ( __noflush_suspending ( md ) & &
! WARN_ON_ONCE ( dm_is_zone_write ( md , bio ) ) ) {
/* NOTE early return due to BLK_STS_DM_REQUEUE below */
bio_list_add_head ( & md - > deferred , bio ) ;
} else {
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
/*
2022-03-10 00:07:06 +03:00
* noflush suspend was interrupted or this is
* a write to a zoned target .
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
*/
2022-03-10 00:07:06 +03:00
io - > status = BLK_STS_IOERR ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
}
2022-03-10 00:07:06 +03:00
spin_unlock_irqrestore ( & md - > deferred_lock , flags ) ;
}
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
2022-03-10 00:07:06 +03:00
io_error = io - > status ;
2022-03-18 07:15:28 +03:00
if ( dm_io_flagged ( io , DM_IO_ACCOUNTED ) )
2022-04-12 11:56:10 +03:00
dm_end_io_acct ( io ) ;
2022-03-10 00:07:06 +03:00
else if ( ! io_error ) {
/*
* Must handle target that DM_MAPIO_SUBMITTED only to
* then bio_endio ( ) rather than dm_submit_bio_remap ( )
*/
2022-04-12 11:56:10 +03:00
__dm_start_io_acct ( io ) ;
dm_end_io_acct ( io ) ;
2022-03-10 00:07:06 +03:00
}
free_io ( io ) ;
smp_wmb ( ) ;
this_cpu_dec ( * md - > pending_io ) ;
2010-09-08 20:07:00 +04:00
2022-03-10 00:07:06 +03:00
/* nudge anyone waiting on suspend queue */
if ( unlikely ( wq_has_sleeper ( & md - > wait ) ) )
wake_up ( & md - > wait ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
2022-04-01 16:47:32 +03:00
if ( io_error = = BLK_STS_DM_REQUEUE | | io_error = = BLK_STS_AGAIN ) {
if ( bio - > bi_opf & REQ_POLLED ) {
/*
* Upper layer won ' t help us poll split bio ( io - > orig_bio
* may only reflect a subset of the pre - split original )
* so clear REQ_POLLED in case of requeue .
*/
2022-03-24 21:36:47 +03:00
bio_clear_polled ( bio ) ;
2022-04-01 16:47:32 +03:00
if ( io_error = = BLK_STS_AGAIN ) {
/* io_uring doesn't handle BLK_STS_AGAIN (yet) */
queue_io ( md , bio ) ;
}
}
2022-03-10 00:07:06 +03:00
return ;
}
if ( bio_is_flush_with_data ( bio ) ) {
/*
* Preflush done for flush with data , reissue
* without REQ_PREFLUSH .
*/
bio - > bi_opf & = ~ REQ_PREFLUSH ;
queue_io ( md , bio ) ;
} else {
/* done with normal IO or empty flush */
if ( io_error )
bio - > bi_status = io_error ;
bio_endio ( bio ) ;
}
}
2005-04-17 02:20:36 +04:00
/*
* Decrements the number of outstanding ios that a bio has been
* cloned into , completing the original io if necc .
*/
2022-03-17 20:52:06 +03:00
static inline void __dm_io_dec_pending ( struct dm_io * io )
{
if ( atomic_dec_and_test ( & io - > io_count ) )
dm_io_complete ( io ) ;
}
static void dm_io_set_error ( struct dm_io * io , blk_status_t error )
2005-04-17 02:20:36 +04:00
{
2022-03-17 20:52:06 +03:00
unsigned long flags ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
/* Push-back supersedes any I/O errors */
2022-03-17 20:52:06 +03:00
spin_lock_irqsave ( & io - > lock , flags ) ;
if ( ! ( io - > status = = BLK_STS_DM_REQUEUE & &
__noflush_suspending ( io - > md ) ) ) {
io - > status = error ;
2005-04-17 02:20:36 +04:00
}
2022-03-17 20:52:06 +03:00
spin_unlock_irqrestore ( & io - > lock , flags ) ;
}
2005-04-17 02:20:36 +04:00
2022-04-12 11:56:14 +03:00
static void dm_io_dec_pending ( struct dm_io * io , blk_status_t error )
2022-03-17 20:52:06 +03:00
{
if ( unlikely ( error ) )
dm_io_set_error ( io , error ) ;
__dm_io_dec_pending ( io ) ;
2005-04-17 02:20:36 +04:00
}
dm: disable DISCARD if the underlying storage no longer supports it
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device. This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.
The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path. But subsequent discards can
cause path failures. Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path. As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path. This
cycle continues as discards are sent until all paths fail.
Fix this by training DM core to disable DISCARD if the underlying
storage already did so.
Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.
Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-03 19:23:11 +03:00
void disable_discard ( struct mapped_device * md )
{
struct queue_limits * limits = dm_get_queue_limits ( md ) ;
/* device doesn't really support DISCARD, disable it */
limits - > max_discard_sectors = 0 ;
}
2017-04-05 20:21:05 +03:00
void disable_write_zeroes ( struct mapped_device * md )
{
struct queue_limits * limits = dm_get_queue_limits ( md ) ;
/* device doesn't really support WRITE ZEROES, disable it */
limits - > max_write_zeroes_sectors = 0 ;
}
2021-02-10 23:26:23 +03:00
static bool swap_bios_limit ( struct dm_target * ti , struct bio * bio )
{
return unlikely ( ( bio - > bi_opf & REQ_SWAP ) ! = 0 ) & & unlikely ( ti - > limit_swap_bios ) ;
}
2015-07-20 16:29:37 +03:00
static void clone_endio ( struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2017-06-03 10:38:06 +03:00
blk_status_t error = bio - > bi_status ;
2022-02-02 19:00:58 +03:00
struct dm_target_io * tio = clone_to_tio ( bio ) ;
2022-03-26 20:46:06 +03:00
struct dm_target * ti = tio - > ti ;
dm_endio_fn endio = ti - > type - > end_io ;
2009-03-16 20:44:36 +03:00
struct dm_io * io = tio - > io ;
2022-03-26 20:46:06 +03:00
struct mapped_device * md = io - > md ;
2005-04-17 02:20:36 +04:00
2022-05-11 16:38:38 +03:00
if ( likely ( bio - > bi_bdev ! = md - > disk - > part0 ) ) {
struct request_queue * q = bdev_get_queue ( bio - > bi_bdev ) ;
2014-06-02 23:50:06 +04:00
2022-05-11 16:38:38 +03:00
if ( unlikely ( error = = BLK_STS_TARGET ) ) {
if ( bio_op ( bio ) = = REQ_OP_DISCARD & &
! bdev_max_discard_sectors ( bio - > bi_bdev ) )
disable_discard ( md ) ;
else if ( bio_op ( bio ) = = REQ_OP_WRITE_ZEROES & &
! q - > limits . max_write_zeroes_sectors )
disable_write_zeroes ( md ) ;
}
if ( static_branch_unlikely ( & zoned_enabled ) & &
unlikely ( blk_queue_is_zoned ( q ) ) )
dm_zone_endio ( io , bio ) ;
}
2020-06-19 09:59:04 +03:00
2017-06-03 10:38:03 +03:00
if ( endio ) {
2022-03-26 20:46:06 +03:00
int r = endio ( ti , bio , & error ) ;
2017-06-03 10:38:03 +03:00
switch ( r ) {
case DM_ENDIO_REQUEUE :
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & zoned_enabled ) ) {
/*
* Requeuing writes to a sequential zone of a zoned
* target will break the sequential write pattern :
* fail such IO .
*/
if ( WARN_ON_ONCE ( dm_is_zone_write ( md , bio ) ) )
error = BLK_STS_IOERR ;
else
error = BLK_STS_DM_REQUEUE ;
} else
2021-05-26 00:24:58 +03:00
error = BLK_STS_DM_REQUEUE ;
2020-08-24 01:36:59 +03:00
fallthrough ;
2017-06-03 10:38:03 +03:00
case DM_ENDIO_DONE :
break ;
case DM_ENDIO_INCOMPLETE :
/* The target will handle the io */
return ;
default :
DMWARN ( " unimplemented target endio return value: %d " , r ) ;
BUG ( ) ;
}
}
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & swap_bios_enabled ) & &
unlikely ( swap_bios_limit ( ti , bio ) ) )
2021-02-10 23:26:23 +03:00
up ( & md - > swap_bios_semaphore ) ;
2022-02-02 19:01:03 +03:00
free_tio ( bio ) ;
2021-05-26 00:24:59 +03:00
dm_io_dec_pending ( io , error ) ;
2005-04-17 02:20:36 +04:00
}
2010-08-12 07:14:10 +04:00
/*
* Return maximum size of I / O possible at the supplied sector up to the current
* target boundary .
*/
2020-09-19 20:12:48 +03:00
static inline sector_t max_io_len_target_boundary ( struct dm_target * ti ,
sector_t target_offset )
2010-08-12 07:14:10 +04:00
{
return ti - > len - target_offset ;
}
2020-09-19 20:12:48 +03:00
static sector_t max_io_len ( struct dm_target * ti , sector_t sector )
2005-04-17 02:20:36 +04:00
{
2020-09-19 20:12:48 +03:00
sector_t target_offset = dm_target_offset ( ti , sector ) ;
sector_t len = max_io_len_target_boundary ( ti , target_offset ) ;
2020-09-19 03:22:30 +03:00
sector_t max_len ;
2005-04-17 02:20:36 +04:00
/*
2020-11-30 18:57:43 +03:00
* Does the target need to split IO even further ?
* - varied ( per target ) IO splitting is a tenet of DM ; this
* explains why stacked chunk_sectors based splitting via
* blk_max_size_offset ( ) isn ' t possible here . So pass in
* ti - > max_io_len to override stacked chunk_sectors .
2005-04-17 02:20:36 +04:00
*/
2020-11-30 18:57:43 +03:00
if ( ti - > max_io_len ) {
max_len = blk_max_size_offset ( ti - > table - > md - > queue ,
target_offset , ti - > max_io_len ) ;
if ( len > max_len )
len = max_len ;
}
2005-04-17 02:20:36 +04:00
return len ;
}
2012-07-27 18:08:00 +04:00
int dm_set_target_max_io_len ( struct dm_target * ti , sector_t len )
{
if ( len > UINT_MAX ) {
DMERR ( " Specified maximum size of target IO (%llu) exceeds limit (%u) " ,
( unsigned long long ) len , UINT_MAX ) ;
ti - > error = " Maximum size of target IO is too large " ;
return - EINVAL ;
}
2019-03-21 23:46:12 +03:00
ti - > max_io_len = ( uint32_t ) len ;
2012-07-27 18:08:00 +04:00
return 0 ;
}
EXPORT_SYMBOL_GPL ( dm_set_target_max_io_len ) ;
2017-04-12 22:35:44 +03:00
static struct dm_target * dm_dax_get_live_target ( struct mapped_device * md ,
2018-04-30 23:06:28 +03:00
sector_t sector , int * srcu_idx )
__acquires ( md - > io_barrier )
2016-06-23 02:54:53 +03:00
{
struct dm_table * map ;
struct dm_target * ti ;
2017-04-12 22:35:44 +03:00
map = dm_get_live_table ( md , srcu_idx ) ;
2016-06-23 02:54:53 +03:00
if ( ! map )
2017-04-12 22:35:44 +03:00
return NULL ;
2016-06-23 02:54:53 +03:00
ti = dm_table_find_target ( map , sector ) ;
2019-08-23 16:55:26 +03:00
if ( ! ti )
2017-04-12 22:35:44 +03:00
return NULL ;
2016-06-23 02:54:53 +03:00
2017-04-12 22:35:44 +03:00
return ti ;
}
2016-06-23 02:54:53 +03:00
2017-04-12 22:35:44 +03:00
static long dm_dax_direct_access ( struct dax_device * dax_dev , pgoff_t pgoff ,
2022-05-14 01:10:58 +03:00
long nr_pages , enum dax_access_mode mode , void * * kaddr ,
pfn_t * pfn )
2017-04-12 22:35:44 +03:00
{
struct mapped_device * md = dax_get_private ( dax_dev ) ;
sector_t sector = pgoff * PAGE_SECTORS ;
struct dm_target * ti ;
long len , ret = - EIO ;
int srcu_idx ;
2016-06-23 02:54:53 +03:00
2017-04-12 22:35:44 +03:00
ti = dm_dax_get_live_target ( md , sector , & srcu_idx ) ;
2016-06-23 02:54:53 +03:00
2017-04-12 22:35:44 +03:00
if ( ! ti )
goto out ;
if ( ! ti - > type - > direct_access )
goto out ;
2020-09-19 20:12:48 +03:00
len = max_io_len ( ti , sector ) / PAGE_SECTORS ;
2017-04-12 22:35:44 +03:00
if ( len < 1 )
goto out ;
nr_pages = min ( len , nr_pages ) ;
2022-05-14 01:10:58 +03:00
ret = ti - > type - > direct_access ( ti , pgoff , nr_pages , mode , kaddr , pfn ) ;
2017-04-12 23:37:44 +03:00
2017-04-12 22:35:44 +03:00
out :
2016-06-23 02:54:53 +03:00
dm_put_live_table ( md , srcu_idx ) ;
2017-04-12 22:35:44 +03:00
return ret ;
2016-06-23 02:54:53 +03:00
}
2020-02-28 19:34:54 +03:00
static int dm_dax_zero_page_range ( struct dax_device * dax_dev , pgoff_t pgoff ,
size_t nr_pages )
{
struct mapped_device * md = dax_get_private ( dax_dev ) ;
sector_t sector = pgoff * PAGE_SECTORS ;
struct dm_target * ti ;
int ret = - EIO ;
int srcu_idx ;
ti = dm_dax_get_live_target ( md , sector , & srcu_idx ) ;
if ( ! ti )
goto out ;
if ( WARN_ON ( ! ti - > type - > dax_zero_page_range ) ) {
/*
* - > zero_page_range ( ) is mandatory dax operation . If we are
* here , something is wrong .
*/
goto out ;
}
ret = ti - > type - > dax_zero_page_range ( ti , pgoff , nr_pages ) ;
out :
dm_put_live_table ( md , srcu_idx ) ;
return ret ;
}
2022-04-23 01:45:06 +03:00
static size_t dm_dax_recovery_write ( struct dax_device * dax_dev , pgoff_t pgoff ,
void * addr , size_t bytes , struct iov_iter * i )
{
struct mapped_device * md = dax_get_private ( dax_dev ) ;
sector_t sector = pgoff * PAGE_SECTORS ;
struct dm_target * ti ;
int srcu_idx ;
long ret = 0 ;
ti = dm_dax_get_live_target ( md , sector , & srcu_idx ) ;
if ( ! ti | | ! ti - > type - > dax_recovery_write )
goto out ;
ret = ti - > type - > dax_recovery_write ( ti , pgoff , addr , bytes , i ) ;
out :
dm_put_live_table ( md , srcu_idx ) ;
return ret ;
}
2014-03-15 02:41:24 +04:00
/*
* A target may call dm_accept_partial_bio only from the map routine . It is
2021-05-26 00:24:54 +03:00
* allowed for all bio types except REQ_PREFLUSH , REQ_OP_ZONE_ * zone management
2022-02-18 07:40:30 +03:00
* operations , REQ_OP_ZONE_APPEND ( zone append writes ) and any bio serviced by
* __send_duplicate_bios ( ) .
2014-03-15 02:41:24 +04:00
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
* sent in a next bio .
*
* A diagram that explains the arithmetics :
* + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - - - - - - +
* | 1 | 2 | 3 |
* + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - - - - - - +
*
* < - - - - - - - - - - - - - - * tio - > len_ptr - - - - - - - - - - - - - - - >
2022-04-15 10:04:59 +03:00
* < - - - - - bio_sectors - - - - - >
2014-03-15 02:41:24 +04:00
* < - - n_sectors - - >
*
* Region 1 was already iterated over with bio_advance or similar function .
* ( it may be empty if the target doesn ' t use bio_advance )
* Region 2 is the remaining bio size that the target wants to process .
* ( it may be empty if region 1 is non - empty , although there is no reason
* to make it empty )
* The target requires that region 3 is to be sent in the next bio .
*
* If the target wants to receive multiple copies of the bio ( via num_ * bios , etc ) ,
* the partially processed part ( the sum of regions 1 + 2 ) must be the same for all
* copies of the bio .
*/
void dm_accept_partial_bio ( struct bio * bio , unsigned n_sectors )
{
2022-02-02 19:00:58 +03:00
struct dm_target_io * tio = clone_to_tio ( bio ) ;
2022-04-15 10:04:59 +03:00
unsigned bio_sectors = bio_sectors ( bio ) ;
2021-05-26 00:24:54 +03:00
2022-03-20 01:04:20 +03:00
BUG_ON ( dm_tio_flagged ( tio , DM_TIO_IS_DUPLICATE_BIO ) ) ;
2021-05-26 00:24:54 +03:00
BUG_ON ( op_is_zone_mgmt ( bio_op ( bio ) ) ) ;
BUG_ON ( bio_op ( bio ) = = REQ_OP_ZONE_APPEND ) ;
2022-04-15 10:04:59 +03:00
BUG_ON ( bio_sectors > * tio - > len_ptr ) ;
BUG_ON ( n_sectors > bio_sectors ) ;
2021-05-26 00:24:54 +03:00
2022-04-15 10:04:59 +03:00
* tio - > len_ptr - = bio_sectors - n_sectors ;
2014-03-15 02:41:24 +04:00
bio - > bi_iter . bi_size = n_sectors < < SECTOR_SHIFT ;
2022-04-12 11:56:13 +03:00
/*
* __split_and_process_bio ( ) may have already saved mapped part
* for accounting but it is being reduced so update accordingly .
*/
dm_io_set_flag ( tio - > io , DM_IO_WAS_SPLIT ) ;
tio - > io - > sectors = n_sectors ;
2014-03-15 02:41:24 +04:00
}
EXPORT_SYMBOL_GPL ( dm_accept_partial_bio ) ;
2022-02-18 07:40:32 +03:00
/*
* @ clone : clone bio that DM core passed to target ' s . map function
* @ tgt_clone : clone of @ clone bio that target needs submitted
*
* Targets should use this interface to submit bios they take
* ownership of when returning DM_MAPIO_SUBMITTED .
*
* Target should also enable ti - > accounts_remapped_io
*/
2022-03-10 19:45:58 +03:00
void dm_submit_bio_remap ( struct bio * clone , struct bio * tgt_clone )
2022-02-18 07:40:32 +03:00
{
struct dm_target_io * tio = clone_to_tio ( clone ) ;
struct dm_io * io = tio - > io ;
/* establish bio that will get submitted */
if ( ! tgt_clone )
tgt_clone = clone ;
/*
* Account io - > origin_bio to DM dev on behalf of target
* that took ownership of IO with DM_MAPIO_SUBMITTED .
*/
2022-04-16 03:08:23 +03:00
dm_start_io_acct ( io , clone ) ;
2022-02-18 07:40:32 +03:00
2022-04-16 03:08:23 +03:00
trace_block_bio_remap ( tgt_clone , disk_devt ( io - > md - > disk ) ,
2022-02-18 07:40:32 +03:00
tio - > old_sector ) ;
2022-04-16 03:08:23 +03:00
submit_bio_noacct ( tgt_clone ) ;
2022-02-18 07:40:32 +03:00
}
EXPORT_SYMBOL_GPL ( dm_submit_bio_remap ) ;
2021-02-10 23:26:23 +03:00
static noinline void __set_swap_bios_limit ( struct mapped_device * md , int latch )
{
mutex_lock ( & md - > swap_bios_lock ) ;
while ( latch < md - > swap_bios ) {
cond_resched ( ) ;
down ( & md - > swap_bios_semaphore ) ;
md - > swap_bios - - ;
}
while ( latch > md - > swap_bios ) {
cond_resched ( ) ;
up ( & md - > swap_bios_semaphore ) ;
md - > swap_bios + + ;
}
mutex_unlock ( & md - > swap_bios_lock ) ;
}
2022-02-02 19:01:02 +03:00
static void __map_bio ( struct bio * clone )
2005-04-17 02:20:36 +04:00
{
2022-02-02 19:01:02 +03:00
struct dm_target_io * tio = clone_to_tio ( clone ) ;
2013-03-02 02:45:46 +04:00
struct dm_target * ti = tio - > ti ;
2022-03-26 20:46:06 +03:00
struct dm_io * io = tio - > io ;
struct mapped_device * md = io - > md ;
int r ;
2005-04-17 02:20:36 +04:00
clone - > bi_end_io = clone_endio ;
/*
2022-02-18 07:40:32 +03:00
* Map the clone .
2005-04-17 02:20:36 +04:00
*/
2022-02-18 07:40:23 +03:00
tio - > old_sector = clone - > bi_iter . bi_sector ;
2017-02-15 19:26:10 +03:00
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & swap_bios_enabled ) & &
unlikely ( swap_bios_limit ( ti , clone ) ) ) {
2021-02-10 23:26:23 +03:00
int latch = get_swap_bios ( ) ;
if ( unlikely ( latch ! = md - > swap_bios ) )
__set_swap_bios_limit ( md , latch ) ;
down ( & md - > swap_bios_semaphore ) ;
}
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & zoned_enabled ) ) {
/*
* Check if the IO needs a special mapping due to zone append
* emulation on zoned target . In this case , dm_zone_map_bio ( )
* calls the target map operation .
*/
if ( unlikely ( dm_emulate_zone_append ( md ) ) )
r = dm_zone_map_bio ( tio ) ;
else
r = ti - > type - > map ( ti , clone ) ;
} else
dm: introduce zone append emulation
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.
To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().
Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.
Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.
The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().
For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.
All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-26 00:25:00 +03:00
r = ti - > type - > map ( ti , clone ) ;
2017-06-03 10:38:02 +03:00
switch ( r ) {
case DM_MAPIO_SUBMITTED :
2022-02-18 07:40:32 +03:00
/* target has assumed ownership of this io */
if ( ! ti - > accounts_remapped_io )
2022-04-16 03:08:23 +03:00
dm_start_io_acct ( io , clone ) ;
2017-06-03 10:38:02 +03:00
break ;
case DM_MAPIO_REMAPPED :
2022-04-16 03:08:23 +03:00
dm_submit_bio_remap ( clone , NULL ) ;
2017-06-03 10:38:02 +03:00
break ;
case DM_MAPIO_KILL :
case DM_MAPIO_REQUEUE :
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & swap_bios_enabled ) & &
unlikely ( swap_bios_limit ( ti , clone ) ) )
2022-03-26 20:46:06 +03:00
up ( & md - > swap_bios_semaphore ) ;
2022-02-02 19:01:03 +03:00
free_tio ( clone ) ;
2022-02-18 07:40:14 +03:00
if ( r = = DM_MAPIO_KILL )
dm_io_dec_pending ( io , BLK_STS_IOERR ) ;
else
dm_io_dec_pending ( io , BLK_STS_DM_REQUEUE ) ;
2017-06-03 10:38:02 +03:00
break ;
default :
2006-12-08 13:41:05 +03:00
DMWARN ( " unimplemented target map return value: %d " , r ) ;
BUG ( ) ;
2005-04-17 02:20:36 +04:00
}
}
2022-04-12 11:56:13 +03:00
static void setup_split_accounting ( struct clone_info * ci , unsigned len )
{
struct dm_io * io = ci - > io ;
if ( ci - > sector_count > len ) {
/*
* Split needed , save the mapped part for accounting .
* NOTE : dm_accept_partial_bio ( ) will update accordingly .
*/
dm_io_set_flag ( io , DM_IO_WAS_SPLIT ) ;
io - > sectors = len ;
}
if ( static_branch_unlikely ( & stats_enabled ) & &
unlikely ( dm_stats_used ( & io - > md - > stats ) ) ) {
/*
* Save bi_sector in terms of its offset from end of
* original bio , only needed for DM - stats ' benefit .
* - saved regardless of whether split needed so that
* dm_accept_partial_bio ( ) doesn ' t need to .
*/
io - > sector_offset = bio_end_sector ( ci - > bio ) - ci - > sector ;
}
}
2017-11-22 22:56:12 +03:00
static void alloc_multiple_bios ( struct bio_list * blist , struct clone_info * ci ,
2022-04-14 18:52:54 +03:00
struct dm_target * ti , unsigned num_bios )
2009-06-22 13:12:20 +04:00
{
2022-02-02 19:01:03 +03:00
struct bio * bio ;
2017-11-22 22:56:12 +03:00
int try ;
2012-10-13 00:02:15 +04:00
2017-11-22 22:56:12 +03:00
for ( try = 0 ; try < 2 ; try + + ) {
int bio_nr ;
if ( try )
2017-12-15 00:30:42 +03:00
mutex_lock ( & ci - > io - > md - > table_devices_lock ) ;
2017-11-22 22:56:12 +03:00
for ( bio_nr = 0 ; bio_nr < num_bios ; bio_nr + + ) {
2022-04-14 18:52:54 +03:00
bio = alloc_tio ( ci , ti , bio_nr , NULL ,
2022-02-02 19:01:01 +03:00
try ? GFP_NOIO : GFP_NOWAIT ) ;
2022-02-02 19:01:03 +03:00
if ( ! bio )
2017-11-22 22:56:12 +03:00
break ;
2022-02-02 19:01:03 +03:00
bio_list_add ( blist , bio ) ;
2017-11-22 22:56:12 +03:00
}
if ( try )
2017-12-15 00:30:42 +03:00
mutex_unlock ( & ci - > io - > md - > table_devices_lock ) ;
2017-11-22 22:56:12 +03:00
if ( bio_nr = = num_bios )
return ;
2022-02-02 19:00:58 +03:00
while ( ( bio = bio_list_pop ( blist ) ) )
2022-02-02 19:01:03 +03:00
free_tio ( bio ) ;
2017-11-22 22:56:12 +03:00
}
2009-06-22 13:12:21 +04:00
}
2022-04-12 11:56:15 +03:00
static int __send_duplicate_bios ( struct clone_info * ci , struct dm_target * ti ,
2014-03-15 02:41:24 +04:00
unsigned num_bios , unsigned * len )
2010-08-12 07:14:09 +04:00
{
2017-11-22 22:56:12 +03:00
struct bio_list blist = BIO_EMPTY_LIST ;
2022-02-02 19:01:00 +03:00
struct bio * clone ;
2022-04-12 11:56:15 +03:00
int ret = 0 ;
2010-08-12 07:14:09 +04:00
2022-02-02 19:01:04 +03:00
switch ( num_bios ) {
case 0 :
break ;
case 1 :
2022-04-12 11:56:13 +03:00
if ( len )
setup_split_accounting ( ci , * len ) ;
2022-02-02 19:01:04 +03:00
clone = alloc_tio ( ci , ti , 0 , len , GFP_NOIO ) ;
2022-02-02 19:01:02 +03:00
__map_bio ( clone ) ;
2022-04-12 11:56:15 +03:00
ret = 1 ;
2022-02-02 19:01:04 +03:00
break ;
default :
2022-04-14 18:52:54 +03:00
/* dm_accept_partial_bio() is not supported with shared tio->len_ptr */
alloc_multiple_bios ( & blist , ci , ti , num_bios ) ;
2022-02-02 19:01:04 +03:00
while ( ( clone = bio_list_pop ( & blist ) ) ) {
2022-03-20 01:04:20 +03:00
dm_tio_set_flag ( clone_to_tio ( clone ) , DM_TIO_IS_DUPLICATE_BIO ) ;
2022-02-02 19:01:04 +03:00
__map_bio ( clone ) ;
2022-04-12 11:56:15 +03:00
ret + = 1 ;
2022-02-02 19:01:04 +03:00
}
break ;
2017-11-22 22:56:12 +03:00
}
2022-04-12 11:56:15 +03:00
return ret ;
2010-08-12 07:14:09 +04:00
}
2022-03-11 00:25:28 +03:00
static void __send_empty_flush ( struct clone_info * ci )
2009-06-22 13:12:20 +04:00
{
2010-08-12 07:14:09 +04:00
unsigned target_nr = 0 ;
2009-06-22 13:12:20 +04:00
struct dm_target * ti ;
2020-09-14 20:59:53 +03:00
struct bio flush_bio ;
/*
* Use an on - stack bio for this , it ' s safe since we don ' t
* need to reference it after submit . It ' s just used as
* the basis for the clone ( s ) .
*/
2022-01-24 12:11:06 +03:00
bio_init ( & flush_bio , ci - > io - > md - > disk - > part0 , NULL , 0 ,
REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC ) ;
2020-11-07 11:30:05 +03:00
2020-09-14 20:59:53 +03:00
ci - > bio = & flush_bio ;
ci - > sector_count = 0 ;
2022-04-15 11:45:13 +03:00
ci - > io - > tio . clone . bi_iter . bi_size = 0 ;
2009-06-22 13:12:20 +04:00
2022-04-12 11:56:15 +03:00
while ( ( ti = dm_table_get_target ( ci - > map , target_nr + + ) ) ) {
int bios ;
atomic_add ( ti - > num_flush_bios , & ci - > io - > io_count ) ;
bios = __send_duplicate_bios ( ci , ti , ti - > num_flush_bios , NULL ) ;
atomic_sub ( ti - > num_flush_bios - bios , & ci - > io - > io_count ) ;
}
/*
* alloc_io ( ) takes one extra reference for submission , so the
* reference won ' t reach 0 without the following subtraction
*/
atomic_sub ( 1 , & ci - > io - > io_count ) ;
2020-09-14 20:59:53 +03:00
bio_uninit ( ci - > bio ) ;
2009-06-22 13:12:20 +04:00
}
2022-02-18 07:40:30 +03:00
static void __send_changing_extent_only ( struct clone_info * ci , struct dm_target * ti ,
unsigned num_bios )
2012-09-27 02:45:42 +04:00
{
2019-05-21 22:58:07 +03:00
unsigned len ;
2022-04-12 11:56:15 +03:00
int bios ;
2012-09-27 02:45:42 +04:00
2020-09-19 20:12:48 +03:00
len = min_t ( sector_t , ci - > sector_count ,
max_io_len_target_boundary ( ti , dm_target_offset ( ti , ci - > sector ) ) ) ;
2019-05-21 22:58:07 +03:00
2022-04-12 11:56:15 +03:00
atomic_add ( num_bios , & ci - > io - > io_count ) ;
bios = __send_duplicate_bios ( ci , ti , num_bios , & len ) ;
/*
* alloc_io ( ) takes one extra reference for submission , so the
* reference won ' t reach 0 without the following ( + 1 ) subtraction
*/
atomic_sub ( num_bios - bios + 1 , & ci - > io - > io_count ) ;
2022-04-14 18:52:54 +03:00
2017-12-08 23:02:11 +03:00
ci - > sector + = len ;
ci - > sector_count - = len ;
2012-09-27 02:45:42 +04:00
}
2019-01-18 22:10:37 +03:00
static bool is_abnormal_io ( struct bio * bio )
{
2022-04-17 20:00:15 +03:00
unsigned int op = bio_op ( bio ) ;
2019-01-18 22:10:37 +03:00
2022-04-17 20:00:15 +03:00
if ( op ! = REQ_OP_READ & & op ! = REQ_OP_WRITE & & op ! = REQ_OP_FLUSH ) {
switch ( op ) {
case REQ_OP_DISCARD :
case REQ_OP_SECURE_ERASE :
case REQ_OP_WRITE_ZEROES :
return true ;
default :
break ;
}
2019-01-18 22:10:37 +03:00
}
2022-04-17 20:00:15 +03:00
return false ;
2019-01-18 22:10:37 +03:00
}
2022-04-17 20:00:15 +03:00
static blk_status_t __process_abnormal_io ( struct clone_info * ci ,
struct dm_target * ti )
2018-03-26 18:49:16 +03:00
{
2020-09-16 04:56:29 +03:00
unsigned num_bios = 0 ;
2018-03-26 18:49:16 +03:00
2022-02-18 07:40:30 +03:00
switch ( bio_op ( ci - > bio ) ) {
2020-09-16 04:56:29 +03:00
case REQ_OP_DISCARD :
num_bios = ti - > num_discard_bios ;
break ;
case REQ_OP_SECURE_ERASE :
num_bios = ti - > num_secure_erase_bios ;
break ;
case REQ_OP_WRITE_ZEROES :
num_bios = ti - > num_write_zeroes_bios ;
break ;
}
2018-03-26 18:49:16 +03:00
2022-02-18 07:40:30 +03:00
/*
* Even though the device advertised support for this type of
* request , that does not mean every target supports it , and
* reconfiguration might also have changed that since the
* check was performed .
*/
2022-03-17 20:52:06 +03:00
if ( unlikely ( ! num_bios ) )
2022-04-17 20:00:15 +03:00
return BLK_STS_NOTSUPP ;
__send_changing_extent_only ( ci , ti , num_bios ) ;
return BLK_STS_OK ;
2018-03-26 18:49:16 +03:00
}
2022-03-05 05:08:04 +03:00
/*
2022-04-12 11:56:16 +03:00
* Reuse - > bi_private as dm_io list head for storing all dm_io instances
2022-03-05 05:08:04 +03:00
* associated with this bio , and this bio ' s bi_private needs to be
* stored in dm_io - > data before the reuse .
*
* bio - > bi_private is owned by fs or upper layer , so block layer won ' t
* touch it after splitting . Meantime it won ' t be changed by anyone after
* bio is submitted . So this reuse is safe .
*/
2022-04-12 11:56:16 +03:00
static inline struct dm_io * * dm_poll_list_head ( struct bio * bio )
2022-03-05 05:08:04 +03:00
{
2022-04-12 11:56:16 +03:00
return ( struct dm_io * * ) & bio - > bi_private ;
2022-03-05 05:08:04 +03:00
}
static void dm_queue_poll_io ( struct bio * bio , struct dm_io * io )
{
2022-04-12 11:56:16 +03:00
struct dm_io * * head = dm_poll_list_head ( bio ) ;
2022-03-05 05:08:04 +03:00
if ( ! ( bio - > bi_opf & REQ_DM_POLL_LIST ) ) {
bio - > bi_opf | = REQ_DM_POLL_LIST ;
/*
* Save . bi_private into dm_io , so that we can reuse
2022-04-12 11:56:16 +03:00
* . bi_private as dm_io list head for storing dm_io list
2022-03-05 05:08:04 +03:00
*/
io - > data = bio - > bi_private ;
/* tell block layer to poll for completion */
bio - > bi_cookie = ~ BLK_QC_T_NONE ;
2022-04-12 11:56:16 +03:00
io - > next = NULL ;
2022-03-05 05:08:04 +03:00
} else {
/*
* bio recursed due to split , reuse original poll list ,
* and save bio - > bi_private too .
*/
2022-04-12 11:56:16 +03:00
io - > data = ( * head ) - > data ;
io - > next = * head ;
2022-03-05 05:08:04 +03:00
}
2022-04-12 11:56:16 +03:00
* head = io ;
2022-03-05 05:08:04 +03:00
}
2013-03-02 02:45:47 +04:00
/*
* Select the correct strategy for processing a non - flush bio .
*/
2022-03-17 20:52:06 +03:00
static blk_status_t __split_and_process_bio ( struct clone_info * ci )
2015-02-26 08:50:28 +03:00
{
2022-02-18 07:40:09 +03:00
struct bio * clone ;
2007-12-13 17:15:25 +03:00
struct dm_target * ti ;
2013-10-30 04:17:49 +04:00
unsigned len ;
2015-02-26 08:50:28 +03:00
2007-12-13 17:15:25 +03:00
ti = dm_table_find_target ( ci - > map , ci - > sector ) ;
2022-04-17 20:00:15 +03:00
if ( unlikely ( ! ti ) )
return BLK_STS_IOERR ;
else if ( unlikely ( ci - > is_abnormal_io ) )
return __process_abnormal_io ( ci , ti ) ;
2017-12-08 23:02:11 +03:00
2022-03-05 05:08:04 +03:00
/*
* Only support bio polling for normal IO , and the target io is
* exactly inside the dm_io instance ( verified in dm_poll_dm_io )
*/
ci - > submit_as_polled = ci - > bio - > bi_opf & REQ_POLLED ;
2015-02-26 08:50:28 +03:00
2020-09-19 20:12:48 +03:00
len = min_t ( sector_t , max_io_len ( ti , ci - > sector ) , ci - > sector_count ) ;
2022-04-12 11:56:13 +03:00
setup_split_accounting ( ci , len ) ;
2022-02-18 07:40:09 +03:00
clone = alloc_tio ( ci , ti , 0 , & len , GFP_NOIO ) ;
__map_bio ( clone ) ;
2015-02-26 08:50:28 +03:00
2013-10-30 04:17:49 +04:00
ci - > sector + = len ;
ci - > sector_count - = len ;
2015-02-26 08:50:28 +03:00
2022-03-17 20:52:06 +03:00
return BLK_STS_OK ;
2015-02-26 08:50:28 +03:00
}
2017-12-09 23:16:42 +03:00
static void init_clone_info ( struct clone_info * ci , struct mapped_device * md ,
2022-04-17 20:00:15 +03:00
struct dm_table * map , struct bio * bio , bool is_abnormal )
2017-12-09 23:16:42 +03:00
{
ci - > map = map ;
ci - > io = alloc_io ( md , bio ) ;
2022-02-18 07:40:11 +03:00
ci - > bio = bio ;
2022-04-17 20:00:15 +03:00
ci - > is_abnormal_io = is_abnormal ;
2022-03-05 05:08:04 +03:00
ci - > submit_as_polled = false ;
2017-12-09 23:16:42 +03:00
ci - > sector = bio - > bi_iter . bi_sector ;
2022-02-18 07:40:11 +03:00
ci - > sector_count = bio_sectors ( bio ) ;
/* Shouldn't happen but sector_count was being set to 0 so... */
2022-03-26 21:14:00 +03:00
if ( static_branch_unlikely ( & zoned_enabled ) & &
WARN_ON_ONCE ( op_is_zone_mgmt ( bio_op ( bio ) ) & & ci - > sector_count ) )
2022-02-18 07:40:11 +03:00
ci - > sector_count = 0 ;
2017-12-09 23:16:42 +03:00
}
2005-04-17 02:20:36 +04:00
/*
2013-03-02 02:45:47 +04:00
* Entry point to split a bio into clones and submit them to the targets .
2005-04-17 02:20:36 +04:00
*/
2022-02-18 07:40:07 +03:00
static void dm_split_and_process_bio ( struct mapped_device * md ,
struct dm_table * map , struct bio * bio )
2015-02-26 08:50:28 +03:00
{
2005-04-17 02:20:36 +04:00
struct clone_info ci ;
2022-03-25 20:53:23 +03:00
struct dm_io * io ;
2022-03-17 20:52:06 +03:00
blk_status_t error = BLK_STS_OK ;
2022-04-17 20:00:15 +03:00
bool is_abnormal ;
is_abnormal = is_abnormal_io ( bio ) ;
if ( unlikely ( is_abnormal ) ) {
/*
* Use blk_queue_split ( ) for abnormal IO ( e . g . discard , etc )
* otherwise associated queue_limits won ' t be imposed .
*/
blk_queue_split ( & bio ) ;
}
2005-04-17 02:20:36 +04:00
2022-04-17 20:00:15 +03:00
init_clone_info ( & ci , md , map , bio , is_abnormal ) ;
2022-03-25 20:53:23 +03:00
io = ci . io ;
2015-02-26 08:50:28 +03:00
2016-08-06 00:35:16 +03:00
if ( bio - > bi_opf & REQ_PREFLUSH ) {
2022-03-11 00:25:28 +03:00
__send_empty_flush ( & ci ) ;
2022-03-10 00:07:06 +03:00
/* dm_io_complete submits any data associated with flush */
2022-02-18 07:40:11 +03:00
goto out ;
2010-09-03 13:56:19 +04:00
}
2015-02-26 08:50:28 +03:00
2022-02-18 07:40:11 +03:00
error = __split_and_process_bio ( & ci ) ;
if ( error | | ! ci . sector_count )
goto out ;
/*
* Remainder must be passed to submit_bio_noacct ( ) so it gets handled
* * after * bios already submitted have been completely processed .
*/
2022-04-12 11:56:13 +03:00
bio_trim ( bio , io - > sectors , ci . sector_count ) ;
trace_block_split ( bio , bio - > bi_iter . bi_sector ) ;
bio_inc_remaining ( bio ) ;
2022-02-18 07:40:11 +03:00
submit_bio_noacct ( bio ) ;
out :
2022-03-05 05:08:04 +03:00
/*
* Drop the extra reference count for non - POLLED bio , and hold one
* reference for POLLED bio , which will be released in dm_poll_bio
*
2022-04-12 11:56:16 +03:00
* Add every dm_io instance into the dm_io list head which is stored
* in bio - > bi_private , so that dm_poll_bio can poll them all .
2022-03-05 05:08:04 +03:00
*/
2022-04-12 11:56:15 +03:00
if ( error | | ! ci . submit_as_polled ) {
/*
* In case of submission failure , the extra reference for
* submitting io isn ' t consumed yet
*/
if ( error )
atomic_dec ( & io - > io_count ) ;
dm_io_dec_pending ( io , error ) ;
} else
2022-03-25 20:53:23 +03:00
dm_queue_poll_io ( bio , io ) ;
2015-02-26 08:50:28 +03:00
}
2021-10-12 14:12:24 +03:00
static void dm_submit_bio ( struct bio * bio )
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
{
2021-01-24 13:02:34 +03:00
struct mapped_device * md = bio - > bi_bdev - > bd_disk - > private_data ;
2013-07-11 02:41:18 +04:00
int srcu_idx ;
struct dm_table * map ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
2022-03-27 04:08:36 +03:00
map = dm_get_live_table_bio ( md , & srcu_idx , bio ) ;
2010-09-08 20:07:00 +04:00
2022-02-22 21:28:12 +03:00
/* If suspended, or map not yet available, queue this IO for later */
if ( unlikely ( test_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ) | |
unlikely ( ! map ) ) {
2020-09-23 23:06:52 +03:00
if ( bio - > bi_opf & REQ_NOWAIT )
bio_wouldblock_error ( bio ) ;
2020-09-30 20:45:20 +03:00
else if ( bio - > bi_opf & REQ_RAHEAD )
2009-04-09 03:27:14 +04:00
bio_io_error ( bio ) ;
2020-09-30 20:45:20 +03:00
else
queue_io ( md , bio ) ;
goto out ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
}
2005-04-17 02:20:36 +04:00
2022-02-18 07:40:07 +03:00
dm_split_and_process_bio ( md , map , bio ) ;
2020-09-30 20:45:20 +03:00
out :
2022-03-27 04:08:36 +03:00
dm_put_live_table_bio ( md , srcu_idx , bio ) ;
2017-12-09 23:16:42 +03:00
}
2022-03-05 05:08:04 +03:00
static bool dm_poll_dm_io ( struct dm_io * io , struct io_comp_batch * iob ,
unsigned int flags )
{
2022-03-20 01:04:20 +03:00
WARN_ON_ONCE ( ! dm_tio_is_normal ( & io - > tio ) ) ;
2022-03-05 05:08:04 +03:00
/* don't poll if the mapped io is done */
if ( atomic_read ( & io - > io_count ) > 1 )
bio_poll ( & io - > tio . clone , iob , flags ) ;
/* bio_poll holds the last reference */
return atomic_read ( & io - > io_count ) = = 1 ;
}
static int dm_poll_bio ( struct bio * bio , struct io_comp_batch * iob ,
unsigned int flags )
{
2022-04-12 11:56:16 +03:00
struct dm_io * * head = dm_poll_list_head ( bio ) ;
struct dm_io * list = * head ;
struct dm_io * tmp = NULL ;
struct dm_io * curr , * next ;
2022-03-05 05:08:04 +03:00
/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
if ( ! ( bio - > bi_opf & REQ_DM_POLL_LIST ) )
return 0 ;
2022-04-12 11:56:16 +03:00
WARN_ON_ONCE ( ! list ) ;
2022-03-05 05:08:04 +03:00
/*
* Restore . bi_private before possibly completing dm_io .
*
* bio_poll ( ) is only possible once @ bio has been completely
* submitted via submit_bio_noacct ( ) ' s depth - first submission .
* So there is no dm_queue_poll_io ( ) race associated with
* clearing REQ_DM_POLL_LIST here .
*/
bio - > bi_opf & = ~ REQ_DM_POLL_LIST ;
2022-04-12 11:56:16 +03:00
bio - > bi_private = list - > data ;
2022-03-05 05:08:04 +03:00
2022-04-12 11:56:16 +03:00
for ( curr = list , next = curr - > next ; curr ; curr = next , next =
curr ? curr - > next : NULL ) {
if ( dm_poll_dm_io ( curr , iob , flags ) ) {
2022-03-05 05:08:04 +03:00
/*
2022-03-17 20:52:06 +03:00
* clone_endio ( ) has already occurred , so no
* error handling is needed here .
2022-03-05 05:08:04 +03:00
*/
2022-04-12 11:56:16 +03:00
__dm_io_dec_pending ( curr ) ;
} else {
curr - > next = tmp ;
tmp = curr ;
2022-03-05 05:08:04 +03:00
}
}
/* Not done? */
2022-04-12 11:56:16 +03:00
if ( tmp ) {
2022-03-05 05:08:04 +03:00
bio - > bi_opf | = REQ_DM_POLL_LIST ;
/* Reset bio->bi_private to dm_io list head */
2022-04-12 11:56:16 +03:00
* head = tmp ;
2022-03-05 05:08:04 +03:00
return 0 ;
}
return 1 ;
}
2005-04-17 02:20:36 +04:00
/*-----------------------------------------------------------------
* An IDR is used to keep track of allocated minor numbers .
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
2006-06-26 11:27:32 +04:00
static void free_minor ( int minor )
2005-04-17 02:20:36 +04:00
{
2006-06-26 11:27:22 +04:00
spin_lock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
idr_remove ( & _minor_idr , minor ) ;
2006-06-26 11:27:22 +04:00
spin_unlock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
}
/*
* See if the device with a specific minor # is free .
*/
2008-04-25 01:10:59 +04:00
static int specific_minor ( int minor )
2005-04-17 02:20:36 +04:00
{
2013-02-28 05:04:26 +04:00
int r ;
2005-04-17 02:20:36 +04:00
if ( minor > = ( 1 < < MINORBITS ) )
return - EINVAL ;
2013-02-28 05:04:26 +04:00
idr_preload ( GFP_KERNEL ) ;
2006-06-26 11:27:22 +04:00
spin_lock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
2013-02-28 05:04:26 +04:00
r = idr_alloc ( & _minor_idr , MINOR_ALLOCED , minor , minor + 1 , GFP_NOWAIT ) ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:27:22 +04:00
spin_unlock ( & _minor_lock ) ;
2013-02-28 05:04:26 +04:00
idr_preload_end ( ) ;
if ( r < 0 )
return r = = - ENOSPC ? - EBUSY : r ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
2008-04-25 01:10:59 +04:00
static int next_free_minor ( int * minor )
2005-04-17 02:20:36 +04:00
{
2013-02-28 05:04:26 +04:00
int r ;
2006-06-26 11:27:21 +04:00
2013-02-28 05:04:26 +04:00
idr_preload ( GFP_KERNEL ) ;
2006-06-26 11:27:22 +04:00
spin_lock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
2013-02-28 05:04:26 +04:00
r = idr_alloc ( & _minor_idr , MINOR_ALLOCED , 0 , 1 < < MINORBITS , GFP_NOWAIT ) ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:27:22 +04:00
spin_unlock ( & _minor_lock ) ;
2013-02-28 05:04:26 +04:00
idr_preload_end ( ) ;
if ( r < 0 )
return r ;
* minor = r ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
2009-09-22 04:01:13 +04:00
static const struct block_device_operations dm_blk_dops ;
2020-10-07 23:41:01 +03:00
static const struct block_device_operations dm_rq_blk_dops ;
2017-04-12 22:35:44 +03:00
static const struct dax_operations dm_dax_ops ;
2005-04-17 02:20:36 +04:00
2009-04-02 22:55:37 +04:00
static void dm_wq_work ( struct work_struct * work ) ;
2021-02-01 08:10:17 +03:00
# ifdef CONFIG_BLK_INLINE_ENCRYPTION
blk-crypto: rename blk_keyslot_manager to blk_crypto_profile
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 21:04:52 +03:00
static void dm_queue_destroy_crypto_profile ( struct request_queue * q )
2021-02-01 08:10:17 +03:00
{
blk-crypto: rename blk_keyslot_manager to blk_crypto_profile
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 21:04:52 +03:00
dm_destroy_crypto_profile ( q - > crypto_profile ) ;
2021-02-01 08:10:17 +03:00
}
# else /* CONFIG_BLK_INLINE_ENCRYPTION */
blk-crypto: rename blk_keyslot_manager to blk_crypto_profile
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 21:04:52 +03:00
static inline void dm_queue_destroy_crypto_profile ( struct request_queue * q )
2021-02-01 08:10:17 +03:00
{
}
# endif /* !CONFIG_BLK_INLINE_ENCRYPTION */
2015-04-28 18:50:29 +03:00
static void cleanup_mapped_device ( struct mapped_device * md )
{
if ( md - > wq )
destroy_workqueue ( md - > wq ) ;
2022-06-08 09:34:06 +03:00
dm_free_md_mempools ( md - > mempools ) ;
2015-04-28 18:50:29 +03:00
2017-04-12 22:35:44 +03:00
if ( md - > dax_dev ) {
2021-11-29 13:21:38 +03:00
dax_remove_host ( md - > disk ) ;
2017-04-12 22:35:44 +03:00
kill_dax ( md - > dax_dev ) ;
put_dax ( md - > dax_dev ) ;
md - > dax_dev = NULL ;
}
2022-02-01 11:39:52 +03:00
dm_cleanup_zoned_dev ( md ) ;
2015-04-28 18:50:29 +03:00
if ( md - > disk ) {
spin_lock ( & _minor_lock ) ;
md - > disk - > private_data = NULL ;
spin_unlock ( & _minor_lock ) ;
2021-08-04 12:41:46 +03:00
if ( dm_get_md_type ( md ) ! = DM_TYPE_NONE ) {
dm_sysfs_exit ( md ) ;
del_gendisk ( md - > disk ) ;
}
blk-crypto: rename blk_keyslot_manager to blk_crypto_profile
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 21:04:52 +03:00
dm_queue_destroy_crypto_profile ( md - > queue ) ;
2021-05-21 08:51:03 +03:00
blk_cleanup_disk ( md - > disk ) ;
2021-08-04 12:41:44 +03:00
}
2015-04-28 18:50:29 +03:00
2022-02-18 07:40:02 +03:00
if ( md - > pending_io ) {
free_percpu ( md - > pending_io ) ;
md - > pending_io = NULL ;
}
2016-10-10 15:35:19 +03:00
cleanup_srcu_struct ( & md - > io_barrier ) ;
2018-01-06 05:17:20 +03:00
mutex_destroy ( & md - > suspend_lock ) ;
mutex_destroy ( & md - > type_lock ) ;
mutex_destroy ( & md - > table_devices_lock ) ;
2021-02-10 23:26:23 +03:00
mutex_destroy ( & md - > swap_bios_lock ) ;
2018-01-06 05:17:20 +03:00
2016-05-12 23:28:10 +03:00
dm_mq_cleanup_mapped_device ( md ) ;
2015-04-28 18:50:29 +03:00
}
2005-04-17 02:20:36 +04:00
/*
* Allocate and initialise a blank device with a given minor .
*/
2006-06-26 11:27:32 +04:00
static struct mapped_device * alloc_dev ( int minor )
2005-04-17 02:20:36 +04:00
{
2016-02-22 20:16:21 +03:00
int r , numa_node_id = dm_get_numa_node ( ) ;
struct mapped_device * md ;
2006-06-26 11:27:21 +04:00
void * old_md ;
2005-04-17 02:20:36 +04:00
2017-11-01 02:33:02 +03:00
md = kvzalloc_node ( sizeof ( * md ) , GFP_KERNEL , numa_node_id ) ;
2005-04-17 02:20:36 +04:00
if ( ! md ) {
DMWARN ( " unable to allocate device, out of memory. " ) ;
return NULL ;
}
2006-06-26 11:27:25 +04:00
if ( ! try_module_get ( THIS_MODULE ) )
2008-02-08 05:10:19 +03:00
goto bad_module_get ;
2006-06-26 11:27:25 +04:00
2005-04-17 02:20:36 +04:00
/* get a minor number for the dev */
2006-06-26 11:27:32 +04:00
if ( minor = = DM_ANY_MINOR )
2008-04-25 01:10:59 +04:00
r = next_free_minor ( & minor ) ;
2006-06-26 11:27:32 +04:00
else
2008-04-25 01:10:59 +04:00
r = specific_minor ( minor ) ;
2005-04-17 02:20:36 +04:00
if ( r < 0 )
2008-02-08 05:10:19 +03:00
goto bad_minor ;
2005-04-17 02:20:36 +04:00
2013-07-11 02:41:18 +04:00
r = init_srcu_struct ( & md - > io_barrier ) ;
if ( r < 0 )
goto bad_io_barrier ;
2016-02-22 20:16:21 +03:00
md - > numa_node_id = numa_node_id ;
2016-01-31 20:05:42 +03:00
md - > init_tio_pdu = false ;
2010-08-12 07:14:01 +04:00
md - > type = DM_TYPE_NONE ;
2008-02-08 05:10:08 +03:00
mutex_init ( & md - > suspend_lock ) ;
2010-08-12 07:14:01 +04:00
mutex_init ( & md - > type_lock ) ;
2014-08-13 22:53:43 +04:00
mutex_init ( & md - > table_devices_lock ) ;
2009-04-02 22:55:39 +04:00
spin_lock_init ( & md - > deferred_lock ) ;
2005-04-17 02:20:36 +04:00
atomic_set ( & md - > holders , 1 ) ;
2006-06-26 11:27:34 +04:00
atomic_set ( & md - > open_count , 0 ) ;
2005-04-17 02:20:36 +04:00
atomic_set ( & md - > event_nr , 0 ) ;
2007-10-20 01:48:01 +04:00
atomic_set ( & md - > uevent_seq , 0 ) ;
INIT_LIST_HEAD ( & md - > uevent_list ) ;
2014-08-13 22:53:43 +04:00
INIT_LIST_HEAD ( & md - > table_devices ) ;
2007-10-20 01:48:01 +04:00
spin_lock_init ( & md - > uevent_lock ) ;
2005-04-17 02:20:36 +04:00
2020-01-27 22:07:23 +03:00
/*
2020-07-01 11:59:43 +03:00
* default to bio - based until DM table is loaded and md - > type
* established . If request - based table is loaded : blk - mq will
* override accordingly .
2020-01-27 22:07:23 +03:00
*/
2021-05-21 08:51:03 +03:00
md - > disk = blk_alloc_disk ( md - > numa_node_id ) ;
2005-04-17 02:20:36 +04:00
if ( ! md - > disk )
2015-04-28 18:50:29 +03:00
goto bad ;
2021-05-21 08:51:03 +03:00
md - > queue = md - > disk - > queue ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:27:25 +04:00
init_waitqueue_head ( & md - > wait ) ;
2009-04-02 22:55:37 +04:00
INIT_WORK ( & md - > work , dm_wq_work ) ;
2006-06-26 11:27:25 +04:00
init_waitqueue_head ( & md - > eventq ) ;
2014-01-14 04:37:54 +04:00
init_completion ( & md - > kobj_holder . completion ) ;
2006-06-26 11:27:25 +04:00
2021-02-10 23:26:23 +03:00
md - > swap_bios = get_swap_bios ( ) ;
sema_init ( & md - > swap_bios_semaphore , md - > swap_bios ) ;
mutex_init ( & md - > swap_bios_lock ) ;
2005-04-17 02:20:36 +04:00
md - > disk - > major = _major ;
md - > disk - > first_minor = minor ;
2021-05-21 08:51:03 +03:00
md - > disk - > minors = 1 ;
2021-11-22 16:06:22 +03:00
md - > disk - > flags | = GENHD_FL_NO_PART ;
2005-04-17 02:20:36 +04:00
md - > disk - > fops = & dm_blk_dops ;
md - > disk - > queue = md - > queue ;
md - > disk - > private_data = md ;
sprintf ( md - > disk - > disk_name , " dm-%d " , minor ) ;
2017-04-12 22:35:44 +03:00
2021-11-29 13:21:36 +03:00
if ( IS_ENABLED ( CONFIG_FS_DAX ) ) {
2021-12-15 11:45:07 +03:00
md - > dax_dev = alloc_dax ( md , & dm_dax_ops ) ;
2021-11-29 13:21:35 +03:00
if ( IS_ERR ( md - > dax_dev ) ) {
md - > dax_dev = NULL ;
2018-03-30 03:22:13 +03:00
goto bad ;
2021-11-29 13:21:35 +03:00
}
2021-12-15 11:45:08 +03:00
set_dax_nocache ( md - > dax_dev ) ;
set_dax_nomc ( md - > dax_dev ) ;
2021-11-29 13:21:38 +03:00
if ( dax_add_host ( md - > dax_dev , md - > disk ) )
2018-03-30 03:22:13 +03:00
goto bad ;
}
2017-04-12 22:35:44 +03:00
2006-03-27 13:17:52 +04:00
format_dev_t ( md - > name , MKDEV ( _major , minor ) ) ;
2005-04-17 02:20:36 +04:00
2021-10-22 00:13:25 +03:00
md - > wq = alloc_workqueue ( " kdmflush/%s " , WQ_MEM_RECLAIM , 0 , md - > name ) ;
2008-02-08 05:11:17 +03:00
if ( ! md - > wq )
2015-04-28 18:50:29 +03:00
goto bad ;
2008-02-08 05:11:17 +03:00
2022-02-18 07:40:02 +03:00
md - > pending_io = alloc_percpu ( unsigned long ) ;
if ( ! md - > pending_io )
goto bad ;
2013-08-16 18:54:23 +04:00
dm_stats_init ( & md - > stats ) ;
2006-06-26 11:27:21 +04:00
/* Populate the mapping, nobody knows we exist yet */
2006-06-26 11:27:22 +04:00
spin_lock ( & _minor_lock ) ;
2006-06-26 11:27:21 +04:00
old_md = idr_replace ( & _minor_idr , md , minor ) ;
2006-06-26 11:27:22 +04:00
spin_unlock ( & _minor_lock ) ;
2006-06-26 11:27:21 +04:00
BUG_ON ( old_md ! = MINOR_ALLOCED ) ;
2005-04-17 02:20:36 +04:00
return md ;
2015-04-28 18:50:29 +03:00
bad :
cleanup_mapped_device ( md ) ;
2013-07-11 02:41:18 +04:00
bad_io_barrier :
2005-04-17 02:20:36 +04:00
free_minor ( minor ) ;
2008-02-08 05:10:19 +03:00
bad_minor :
2006-06-26 11:27:25 +04:00
module_put ( THIS_MODULE ) ;
2008-02-08 05:10:19 +03:00
bad_module_get :
2017-11-01 02:33:02 +03:00
kvfree ( md ) ;
2005-04-17 02:20:36 +04:00
return NULL ;
}
2007-10-20 01:38:43 +04:00
static void unlock_fs ( struct mapped_device * md ) ;
2005-04-17 02:20:36 +04:00
static void free_dev ( struct mapped_device * md )
{
2008-09-03 11:01:48 +04:00
int minor = MINOR ( disk_devt ( md - > disk ) ) ;
2006-02-25 00:04:25 +03:00
2009-06-22 13:12:17 +04:00
unlock_fs ( md ) ;
2014-10-18 03:46:36 +04:00
2015-04-28 18:50:29 +03:00
cleanup_mapped_device ( md ) ;
2015-03-24 00:01:43 +03:00
2014-08-13 22:53:43 +04:00
free_table_devices ( & md - > table_devices ) ;
2015-03-24 00:01:43 +03:00
dm_stats_cleanup ( & md - > stats ) ;
free_minor ( minor ) ;
2006-06-26 11:27:25 +04:00
module_put ( THIS_MODULE ) ;
2017-11-01 02:33:02 +03:00
kvfree ( md ) ;
2005-04-17 02:20:36 +04:00
}
/*
* Bind a table to the device .
*/
static void event_callback ( void * context )
{
2007-10-20 01:48:01 +04:00
unsigned long flags ;
LIST_HEAD ( uevents ) ;
2005-04-17 02:20:36 +04:00
struct mapped_device * md = ( struct mapped_device * ) context ;
2007-10-20 01:48:01 +04:00
spin_lock_irqsave ( & md - > uevent_lock , flags ) ;
list_splice_init ( & md - > uevent_list , & uevents ) ;
spin_unlock_irqrestore ( & md - > uevent_lock , flags ) ;
2008-08-25 14:56:05 +04:00
dm_send_uevents ( & uevents , & disk_to_dev ( md - > disk ) - > kobj ) ;
2007-10-20 01:48:01 +04:00
2005-04-17 02:20:36 +04:00
atomic_inc ( & md - > event_nr ) ;
wake_up ( & md - > eventq ) ;
2017-09-20 14:29:49 +03:00
dm_issue_global_event ( ) ;
2005-04-17 02:20:36 +04:00
}
2009-12-11 02:52:24 +03:00
/*
* Returns old map , which caller must destroy .
*/
static struct dm_table * __bind ( struct mapped_device * md , struct dm_table * t ,
struct queue_limits * limits )
2005-04-17 02:20:36 +04:00
{
2009-12-11 02:52:24 +03:00
struct dm_table * old_map ;
2005-04-17 02:20:36 +04:00
sector_t size ;
2018-06-07 23:42:06 +03:00
int ret ;
2005-04-17 02:20:36 +04:00
2016-09-01 01:17:04 +03:00
lockdep_assert_held ( & md - > suspend_lock ) ;
2005-04-17 02:20:36 +04:00
size = dm_table_get_size ( t ) ;
2006-03-27 13:17:54 +04:00
/*
* Wipe any geometry if the size of the table changed .
*/
2013-08-16 18:54:23 +04:00
if ( size ! = dm_get_size ( md ) )
2006-03-27 13:17:54 +04:00
memset ( & md - > geometry , 0 , sizeof ( md - > geometry ) ) ;
2021-03-22 17:13:54 +03:00
if ( ! get_capacity ( md - > disk ) )
set_capacity ( md - > disk , size ) ;
else
set_capacity_and_notify ( md - > disk , size ) ;
dm table: rework reference counting
Rework table reference counting.
The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.
This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
same process.
* stale reference from some other kernel code keeps the table
constructed, which keeps some devices open, even after successful
return from "dmsetup remove". This can confuse lvm and prevent closing
of underlying devices or reusing device minor numbers.
The patch changes reference counting so that the table destructor can be
called only at predetermined places.
The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders. A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.
Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.
When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor. We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.
This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.
Finally remove the temporary protection added to dm_any_congested().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 06:05:10 +03:00
2005-07-29 08:16:00 +04:00
dm_table_event_callback ( t , event_callback , md ) ;
2022-02-22 21:38:02 +03:00
if ( dm_table_request_based ( t ) ) {
2016-02-01 01:22:27 +03:00
/*
2020-10-07 22:15:08 +03:00
* Leverage the fact that request - based DM targets are
* immutable singletons - used to optimize dm_mq_queue_rq .
2016-02-01 01:22:27 +03:00
*/
md - > immutable_target = dm_table_get_immutable_target ( t ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
2022-06-08 09:34:06 +03:00
/*
* There is no need to reload with request - based dm because the
* size of front_pad doesn ' t change .
*
* Note for future : If you are to reload bioset , prep - ed
* requests in the queue may refer to bio from the old bioset ,
* so you must walk through the queue to unprep .
*/
if ( ! md - > mempools ) {
md - > mempools = t - > mempools ;
t - > mempools = NULL ;
}
} else {
/*
* The md may already have mempools that need changing .
* If so , reload bioset because front_pad may have changed
* because a different table was loaded .
*/
dm_free_md_mempools ( md - > mempools ) ;
md - > mempools = t - > mempools ;
t - > mempools = NULL ;
2018-06-07 23:42:06 +03:00
}
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
2022-02-22 21:38:02 +03:00
ret = dm_table_set_restrictions ( t , md - > queue , limits ) ;
dm: introduce zone append emulation
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.
To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().
Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.
Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.
The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().
For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.
All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-26 00:25:00 +03:00
if ( ret ) {
old_map = ERR_PTR ( ret ) ;
goto out ;
}
2014-11-23 20:34:29 +03:00
old_map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2016-02-22 22:14:24 +03:00
rcu_assign_pointer ( md - > map , ( void * ) t ) ;
2011-11-01 00:19:04 +04:00
md - > immutable_target_type = dm_table_get_immutable_target_type ( t ) ;
2014-11-05 16:35:50 +03:00
if ( old_map )
dm_sync_table ( md ) ;
2018-06-07 23:42:06 +03:00
out :
2009-12-11 02:52:24 +03:00
return old_map ;
2005-04-17 02:20:36 +04:00
}
2009-12-11 02:52:23 +03:00
/*
* Returns unbound table for the caller to free .
*/
static struct dm_table * __unbind ( struct mapped_device * md )
2005-04-17 02:20:36 +04:00
{
2014-11-23 20:34:29 +03:00
struct dm_table * map = rcu_dereference_protected ( md - > map , 1 ) ;
2005-04-17 02:20:36 +04:00
if ( ! map )
2009-12-11 02:52:23 +03:00
return NULL ;
2005-04-17 02:20:36 +04:00
dm_table_event_callback ( map , NULL , NULL ) ;
2014-03-23 22:28:27 +04:00
RCU_INIT_POINTER ( md - > map , NULL ) ;
2013-07-11 02:41:18 +04:00
dm_sync_table ( md ) ;
2009-12-11 02:52:23 +03:00
return map ;
2005-04-17 02:20:36 +04:00
}
/*
* Constructor for a new device .
*/
2006-06-26 11:27:32 +04:00
int dm_create ( int minor , struct mapped_device * * result )
2005-04-17 02:20:36 +04:00
{
struct mapped_device * md ;
2006-06-26 11:27:32 +04:00
md = alloc_dev ( minor ) ;
2005-04-17 02:20:36 +04:00
if ( ! md )
return - ENXIO ;
dm ima: measure data on table load
DM configures a block device with various target specific attributes
passed to it as a table. DM loads the table, and calls each target’s
respective constructors with the attributes as input parameters.
Some of these attributes are critical to ensure the device meets
certain security bar. Thus, IMA should measure these attributes, to
ensure they are not tampered with, during the lifetime of the device.
So that the external services can have high confidence in the
configuration of the block-devices on a given system.
Some devices may have large tables. And a given device may change its
state (table-load, suspend, resume, rename, remove, table-clear etc.)
many times. Measuring these attributes each time when the device
changes its state will significantly increase the size of the IMA logs.
Further, once configured, these attributes are not expected to change
unless a new table is loaded, or a device is removed and recreated.
Therefore the clear-text of the attributes should only be measured
during table load, and the hash of the active/inactive table should be
measured for the remaining device state changes.
Export IMA function ima_measure_critical_data() to allow measurement
of DM device parameters, as well as target specific attributes, during
table load. Compute the hash of the inactive table and store it for
measurements during future state change. If a load is called multiple
times, update the inactive table hash with the hash of the latest
populated table. So that the correct inactive table hash is measured
when the device transitions to different states like resume, remove,
rename, etc.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-07-13 03:48:58 +03:00
dm_ima_reset_data ( md ) ;
2005-04-17 02:20:36 +04:00
* result = md ;
return 0 ;
}
2010-08-12 07:14:01 +04:00
/*
* Functions to manage md - > type .
* All are required to hold md - > type_lock .
*/
void dm_lock_md_type ( struct mapped_device * md )
{
mutex_lock ( & md - > type_lock ) ;
}
void dm_unlock_md_type ( struct mapped_device * md )
{
mutex_unlock ( & md - > type_lock ) ;
}
2017-04-27 20:11:23 +03:00
void dm_set_md_type ( struct mapped_device * md , enum dm_queue_mode type )
2010-08-12 07:14:01 +04:00
{
2013-08-28 02:57:03 +04:00
BUG_ON ( ! mutex_is_locked ( & md - > type_lock ) ) ;
2010-08-12 07:14:01 +04:00
md - > type = type ;
}
2017-04-27 20:11:23 +03:00
enum dm_queue_mode dm_get_md_type ( struct mapped_device * md )
2010-08-12 07:14:01 +04:00
{
return md - > type ;
}
2011-11-01 00:19:04 +04:00
struct target_type * dm_get_immutable_target_type ( struct mapped_device * md )
{
return md - > immutable_target_type ;
}
dm mpath: disable WRITE SAME if it fails
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.
The WRITE SAME heuristics, with both the original commit 5db44863b6eb
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).
When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled. This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.
This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop). A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.
Before this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920
After this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0
It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # 3.10+
2013-09-19 20:13:58 +04:00
/*
* The queue_limits are only valid as long as you have a reference
* count on ' md ' .
*/
struct queue_limits * dm_get_queue_limits ( struct mapped_device * md )
{
BUG_ON ( ! atomic_read ( & md - > holders ) ) ;
return & md - > queue - > limits ;
}
EXPORT_SYMBOL_GPL ( dm_get_queue_limits ) ;
2010-08-12 07:14:02 +04:00
/*
* Setup the DM device ' s queue based on md ' s type
*/
2016-01-31 20:05:42 +03:00
int dm_setup_md_queue ( struct mapped_device * md , struct dm_table * t )
2010-08-12 07:14:02 +04:00
{
2021-08-04 12:41:45 +03:00
enum dm_queue_mode type = dm_table_get_type ( t ) ;
2018-01-09 04:03:04 +03:00
struct queue_limits limits ;
2021-08-04 12:41:45 +03:00
int r ;
2015-03-08 08:51:47 +03:00
2016-06-23 02:54:53 +03:00
switch ( type ) {
2015-03-08 08:51:47 +03:00
case DM_TYPE_REQUEST_BASED :
2020-10-07 23:41:01 +03:00
md - > disk - > fops = & dm_rq_blk_dops ;
2016-05-25 04:16:51 +03:00
r = dm_mq_init_request_queue ( md , t ) ;
2015-03-08 08:51:47 +03:00
if ( r ) {
2020-10-07 23:41:01 +03:00
DMERR ( " Cannot initialize queue for request-based dm mapped device " ) ;
2015-03-08 08:51:47 +03:00
return r ;
}
break ;
case DM_TYPE_BIO_BASED :
2016-06-23 02:54:53 +03:00
case DM_TYPE_DAX_BIO_BASED :
2015-03-08 08:51:47 +03:00
break ;
2017-04-27 20:11:23 +03:00
case DM_TYPE_NONE :
WARN_ON_ONCE ( true ) ;
break ;
2010-08-12 07:14:02 +04:00
}
2018-01-09 04:03:04 +03:00
r = dm_calculate_queue_limits ( t , & limits ) ;
if ( r ) {
DMERR ( " Cannot calculate initial queue limits " ) ;
return r ;
}
dm: introduce zone append emulation
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.
To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().
Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.
Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.
The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().
For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.
All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-26 00:25:00 +03:00
r = dm_table_set_restrictions ( t , md - > queue , & limits ) ;
if ( r )
return r ;
2021-10-16 02:30:22 +03:00
r = add_disk ( md - > disk ) ;
if ( r )
return r ;
2018-01-09 04:03:04 +03:00
2021-08-04 12:41:46 +03:00
r = dm_sysfs_init ( md ) ;
if ( r ) {
del_gendisk ( md - > disk ) ;
return r ;
}
md - > type = type ;
2010-08-12 07:14:02 +04:00
return 0 ;
}
dm: fix a race condition in dm_get_md
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-17 22:30:53 +03:00
struct mapped_device * dm_get_md ( dev_t dev )
2005-04-17 02:20:36 +04:00
{
struct mapped_device * md ;
unsigned minor = MINOR ( dev ) ;
if ( MAJOR ( dev ) ! = _major | | minor > = ( 1 < < MINORBITS ) )
return NULL ;
2006-06-26 11:27:22 +04:00
spin_lock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
md = idr_find ( & _minor_idr , minor ) ;
2017-11-07 00:40:10 +03:00
if ( ! md | | md = = MINOR_ALLOCED | | ( MINOR ( disk_devt ( dm_disk ( md ) ) ) ! = minor ) | |
test_bit ( DMF_FREEING , & md - > flags ) | | dm_deleting_md ( md ) ) {
md = NULL ;
goto out ;
2006-06-26 11:27:23 +04:00
}
2017-11-07 00:40:10 +03:00
dm_get ( md ) ;
2006-06-26 11:27:23 +04:00
out :
2006-06-26 11:27:22 +04:00
spin_unlock ( & _minor_lock ) ;
2005-04-17 02:20:36 +04:00
2006-01-06 11:20:00 +03:00
return md ;
}
2011-11-01 00:19:06 +04:00
EXPORT_SYMBOL_GPL ( dm_get_md ) ;
2006-01-06 11:20:01 +03:00
2006-03-27 13:17:53 +04:00
void * dm_get_mdptr ( struct mapped_device * md )
2006-01-06 11:20:00 +03:00
{
2006-03-27 13:17:53 +04:00
return md - > interface_ptr ;
2005-04-17 02:20:36 +04:00
}
void dm_set_mdptr ( struct mapped_device * md , void * ptr )
{
md - > interface_ptr = ptr ;
}
void dm_get ( struct mapped_device * md )
{
atomic_inc ( & md - > holders ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
BUG_ON ( test_bit ( DMF_FREEING , & md - > flags ) ) ;
2005-04-17 02:20:36 +04:00
}
dm snapshot: suspend merging snapshot when doing exception handover
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.
However, a similar problem exists in snapshot merging. When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin". Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.
To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.
Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.
In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear). Then we release _origins_lock, suspend the
device and grab _origins_lock again.
NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-26 19:41:28 +03:00
int dm_hold ( struct mapped_device * md )
{
spin_lock ( & _minor_lock ) ;
if ( test_bit ( DMF_FREEING , & md - > flags ) ) {
spin_unlock ( & _minor_lock ) ;
return - EBUSY ;
}
dm_get ( md ) ;
spin_unlock ( & _minor_lock ) ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( dm_hold ) ;
2006-06-26 11:27:35 +04:00
const char * dm_device_name ( struct mapped_device * md )
{
return md - > name ;
}
EXPORT_SYMBOL_GPL ( dm_device_name ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
static void __dm_destroy ( struct mapped_device * md , bool wait )
2005-04-17 02:20:36 +04:00
{
2006-03-27 13:17:54 +04:00
struct dm_table * map ;
2013-07-11 02:41:18 +04:00
int srcu_idx ;
2005-04-17 02:20:36 +04:00
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
might_sleep ( ) ;
2006-06-26 11:27:23 +04:00
2015-03-24 00:01:43 +03:00
spin_lock ( & _minor_lock ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
idr_replace ( & _minor_idr , MINOR_ALLOCED , MINOR ( disk_devt ( dm_disk ( md ) ) ) ) ;
set_bit ( DMF_FREEING , & md - > flags ) ;
spin_unlock ( & _minor_lock ) ;
2016-09-01 01:17:49 +03:00
2022-02-17 10:52:31 +03:00
blk_mark_disk_dead ( md - > disk ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
2015-02-27 22:04:27 +03:00
/*
* Take suspend_lock so that presuspend and postsuspend methods
* do not race with internal suspend .
*/
mutex_lock ( & md - > suspend_lock ) ;
2015-10-01 11:31:51 +03:00
map = dm_get_live_table ( md , & srcu_idx ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
if ( ! dm_suspended_md ( md ) ) {
dm_table_presuspend_targets ( map ) ;
2020-02-24 12:20:28 +03:00
set_bit ( DMF_SUSPENDED , & md - > flags ) ;
2020-07-23 17:42:09 +03:00
set_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
dm_table_postsuspend_targets ( map ) ;
2005-04-17 02:20:36 +04:00
}
2013-07-11 02:41:18 +04:00
/* dm_put_live_table must be before msleep, otherwise deadlock is possible */
dm_put_live_table ( md , srcu_idx ) ;
2015-10-01 11:31:51 +03:00
mutex_unlock ( & md - > suspend_lock ) ;
2013-07-11 02:41:18 +04:00
dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 07:13:56 +04:00
/*
* Rare , but there may be I / O requests still going to complete ,
* for example . Wait for all references to disappear .
* No one should increment the reference count of the mapped_device ,
* after the mapped_device state becomes DMF_FREEING .
*/
if ( wait )
while ( atomic_read ( & md - > holders ) )
msleep ( 1 ) ;
else if ( atomic_read ( & md - > holders ) )
DMWARN ( " %s: Forcibly removing mapped_device still in use! (%d users) " ,
dm_device_name ( md ) , atomic_read ( & md - > holders ) ) ;
dm_table_destroy ( __unbind ( md ) ) ;
free_dev ( md ) ;
}
void dm_destroy ( struct mapped_device * md )
{
__dm_destroy ( md , true ) ;
}
void dm_destroy_immediate ( struct mapped_device * md )
{
__dm_destroy ( md , false ) ;
}
void dm_put ( struct mapped_device * md )
{
atomic_dec ( & md - > holders ) ;
2005-04-17 02:20:36 +04:00
}
2007-05-09 13:32:56 +04:00
EXPORT_SYMBOL_GPL ( dm_put ) ;
2005-04-17 02:20:36 +04:00
2022-02-18 07:40:02 +03:00
static bool dm_in_flight_bios ( struct mapped_device * md )
2020-06-24 23:00:58 +03:00
{
int cpu ;
2022-02-18 07:40:02 +03:00
unsigned long sum = 0 ;
2020-06-24 23:00:58 +03:00
2022-02-18 07:40:02 +03:00
for_each_possible_cpu ( cpu )
sum + = * per_cpu_ptr ( md - > pending_io , cpu ) ;
2020-06-24 23:00:58 +03:00
return sum ! = 0 ;
}
2021-06-11 11:28:17 +03:00
static int dm_wait_for_bios_completion ( struct mapped_device * md , unsigned int task_state )
2008-02-08 05:10:30 +03:00
{
int r = 0 ;
2016-09-01 01:16:43 +03:00
DEFINE_WAIT ( wait ) ;
2008-02-08 05:10:30 +03:00
2020-06-24 23:00:58 +03:00
while ( true ) {
2016-09-01 01:16:43 +03:00
prepare_to_wait ( & md - > wait , & wait , task_state ) ;
2008-02-08 05:10:30 +03:00
2022-02-18 07:40:02 +03:00
if ( ! dm_in_flight_bios ( md ) )
2008-02-08 05:10:30 +03:00
break ;
2016-09-01 01:16:22 +03:00
if ( signal_pending_state ( task_state , current ) ) {
2008-02-08 05:10:30 +03:00
r = - EINTR ;
break ;
}
io_schedule ( ) ;
}
2016-09-01 01:16:43 +03:00
finish_wait ( & md - > wait , & wait ) ;
2009-04-02 22:55:39 +04:00
2022-02-18 07:40:02 +03:00
smp_rmb ( ) ;
2008-02-08 05:10:30 +03:00
return r ;
}
2021-06-11 11:28:17 +03:00
static int dm_wait_for_completion ( struct mapped_device * md , unsigned int task_state )
2020-06-24 23:00:58 +03:00
{
int r = 0 ;
if ( ! queue_is_mq ( md - > queue ) )
return dm_wait_for_bios_completion ( md , task_state ) ;
while ( true ) {
if ( ! blk_mq_queue_inflight ( md - > queue ) )
break ;
if ( signal_pending_state ( task_state , current ) ) {
r = - EINTR ;
break ;
}
msleep ( 5 ) ;
}
return r ;
}
2005-04-17 02:20:36 +04:00
/*
* Process the deferred bios
*/
2009-04-02 22:55:38 +04:00
static void dm_wq_work ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
2020-09-28 20:41:36 +03:00
struct mapped_device * md = container_of ( work , struct mapped_device , work ) ;
struct bio * bio ;
2009-04-02 22:55:38 +04:00
2009-04-09 03:27:15 +04:00
while ( ! test_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ) {
2009-04-09 03:27:13 +04:00
spin_lock_irq ( & md - > deferred_lock ) ;
2020-09-28 20:41:36 +03:00
bio = bio_list_pop ( & md - > deferred ) ;
2009-04-09 03:27:13 +04:00
spin_unlock_irq ( & md - > deferred_lock ) ;
2020-09-28 20:41:36 +03:00
if ( ! bio )
2009-04-09 03:27:13 +04:00
break ;
2009-04-02 22:55:39 +04:00
2020-09-28 20:41:36 +03:00
submit_bio_noacct ( bio ) ;
2009-04-02 22:55:39 +04:00
}
2005-04-17 02:20:36 +04:00
}
2009-04-02 22:55:36 +04:00
static void dm_queue_flush ( struct mapped_device * md )
2008-02-08 05:11:17 +03:00
{
2009-04-09 03:27:15 +04:00
clear_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
2014-03-17 21:06:10 +04:00
smp_mb__after_atomic ( ) ;
2009-04-02 22:55:37 +04:00
queue_work ( md - > wq , & md - > work ) ;
2008-02-08 05:11:17 +03:00
}
2005-04-17 02:20:36 +04:00
/*
2009-12-11 02:52:24 +03:00
* Swap in a new table , returning the old one for the caller to destroy .
2005-04-17 02:20:36 +04:00
*/
2009-12-11 02:52:24 +03:00
struct dm_table * dm_swap_table ( struct mapped_device * md , struct dm_table * table )
2005-04-17 02:20:36 +04:00
{
2013-03-02 02:45:48 +04:00
struct dm_table * live_map = NULL , * map = ERR_PTR ( - EINVAL ) ;
2009-06-22 13:12:34 +04:00
struct queue_limits limits ;
2009-12-11 02:52:24 +03:00
int r ;
2005-04-17 02:20:36 +04:00
2008-02-08 05:10:08 +03:00
mutex_lock ( & md - > suspend_lock ) ;
2005-04-17 02:20:36 +04:00
/* device must be suspended */
2009-12-11 02:52:26 +03:00
if ( ! dm_suspended_md ( md ) )
2005-07-13 02:53:05 +04:00
goto out ;
2005-04-17 02:20:36 +04:00
2012-09-27 02:45:45 +04:00
/*
* If the new table has no data devices , retain the existing limits .
* This helps multipath with queue_if_no_path if all paths disappear ,
* then new I / O is queued based on these limits , and then some paths
* reappear .
*/
if ( dm_table_has_no_data_devices ( table ) ) {
2013-07-11 02:41:18 +04:00
live_map = dm_get_live_table_fast ( md ) ;
2012-09-27 02:45:45 +04:00
if ( live_map )
limits = md - > queue - > limits ;
2013-07-11 02:41:18 +04:00
dm_put_live_table_fast ( md ) ;
2012-09-27 02:45:45 +04:00
}
2013-03-02 02:45:48 +04:00
if ( ! live_map ) {
r = dm_calculate_queue_limits ( table , & limits ) ;
if ( r ) {
map = ERR_PTR ( r ) ;
goto out ;
}
2009-12-11 02:52:24 +03:00
}
2009-06-22 13:12:34 +04:00
2009-12-11 02:52:24 +03:00
map = __bind ( md , table , & limits ) ;
2017-09-20 14:29:49 +03:00
dm_issue_global_event ( ) ;
2005-04-17 02:20:36 +04:00
2005-07-13 02:53:05 +04:00
out :
2008-02-08 05:10:08 +03:00
mutex_unlock ( & md - > suspend_lock ) ;
2009-12-11 02:52:24 +03:00
return map ;
2005-04-17 02:20:36 +04:00
}
/*
* Functions to lock and unlock any filesystem running on the
* device .
*/
2005-07-29 08:16:00 +04:00
static int lock_fs ( struct mapped_device * md )
2005-04-17 02:20:36 +04:00
{
2006-01-06 11:20:05 +03:00
int r ;
2005-04-17 02:20:36 +04:00
2020-11-24 13:54:06 +03:00
WARN_ON ( test_bit ( DMF_FROZEN , & md - > flags ) ) ;
2006-01-06 11:20:06 +03:00
2020-11-26 12:41:07 +03:00
r = freeze_bdev ( md - > disk - > part0 ) ;
2020-11-24 13:54:06 +03:00
if ( ! r )
set_bit ( DMF_FROZEN , & md - > flags ) ;
return r ;
2005-04-17 02:20:36 +04:00
}
2005-07-29 08:16:00 +04:00
static void unlock_fs ( struct mapped_device * md )
2005-04-17 02:20:36 +04:00
{
2006-01-06 11:20:06 +03:00
if ( ! test_bit ( DMF_FROZEN , & md - > flags ) )
return ;
2020-11-26 12:41:07 +03:00
thaw_bdev ( md - > disk - > part0 ) ;
2006-01-06 11:20:06 +03:00
clear_bit ( DMF_FROZEN , & md - > flags ) ;
2005-04-17 02:20:36 +04:00
}
/*
2016-09-01 01:16:02 +03:00
* @ suspend_flags : DM_SUSPEND_LOCKFS_FLAG and / or DM_SUSPEND_NOFLUSH_FLAG
* @ task_state : e . g . TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
* @ dmf_suspended_flag : DMF_SUSPENDED or DMF_SUSPENDED_INTERNALLY
*
2014-10-29 01:34:52 +03:00
* If __dm_suspend returns 0 , the device is completely quiescent
* now . There is no request - processing activity . All new requests
* are being added to md - > deferred list .
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
*/
2014-10-29 01:34:52 +03:00
static int __dm_suspend ( struct mapped_device * md , struct dm_table * map ,
2021-06-11 11:28:17 +03:00
unsigned suspend_flags , unsigned int task_state ,
2016-08-02 20:07:20 +03:00
int dmf_suspended_flag )
2005-04-17 02:20:36 +04:00
{
2014-10-29 01:34:52 +03:00
bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG ;
bool noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG ;
int r ;
2005-04-17 02:20:36 +04:00
2016-09-01 01:17:04 +03:00
lockdep_assert_held ( & md - > suspend_lock ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
/*
* DMF_NOFLUSH_SUSPENDING must be set before presuspend .
* This flag is cleared before dm_suspend returns .
*/
if ( noflush )
set_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
2017-04-27 20:11:26 +03:00
else
2020-05-14 19:55:39 +03:00
DMDEBUG ( " %s: suspending with flush " , dm_device_name ( md ) ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
2014-10-29 03:13:31 +03:00
/*
* This gets reverted if there ' s an error later and the targets
* provide the . presuspend_undo hook .
*/
2005-07-29 08:15:57 +04:00
dm_table_presuspend_targets ( map ) ;
2009-06-22 13:12:17 +04:00
/*
2009-12-11 02:52:16 +03:00
* Flush I / O to the device .
* Any I / O submitted after lock_fs ( ) may not be flushed .
* noflush takes precedence over do_lockfs .
* ( lock_fs ( ) flushes I / Os and waits for them to complete . )
2009-06-22 13:12:17 +04:00
*/
if ( ! noflush & & do_lockfs ) {
r = lock_fs ( md ) ;
2014-10-29 03:13:31 +03:00
if ( r ) {
dm_table_presuspend_undo_targets ( map ) ;
2014-10-29 01:34:52 +03:00
return r ;
2014-10-29 03:13:31 +03:00
}
2006-01-06 11:20:06 +03:00
}
2005-04-17 02:20:36 +04:00
/*
2009-04-09 03:27:15 +04:00
* Here we must make sure that no processes are submitting requests
* to target drivers i . e . no one may be executing
2022-02-18 07:40:07 +03:00
* dm_split_and_process_bio from dm_submit_bio .
2009-04-09 03:27:15 +04:00
*
2022-02-18 07:40:07 +03:00
* To get all processes out of dm_split_and_process_bio in dm_submit_bio ,
2009-04-09 03:27:15 +04:00
* we take the write lock . To prevent any process from reentering
2022-02-18 07:40:07 +03:00
* dm_split_and_process_bio from dm_submit_bio and quiesce the thread
2020-09-30 22:12:04 +03:00
* ( dm_wq_work ) , we set DMF_BLOCK_IO_FOR_SUSPEND and call
2010-09-08 20:07:00 +04:00
* flush_workqueue ( md - > wq ) .
2005-04-17 02:20:36 +04:00
*/
2009-04-09 03:27:14 +04:00
set_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
2014-11-05 16:35:50 +03:00
if ( map )
synchronize_srcu ( & md - > io_barrier ) ;
2005-04-17 02:20:36 +04:00
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11 02:52:18 +03:00
/*
2010-09-08 20:07:00 +04:00
* Stop md - > queue before flushing md - > wq in case request - based
* dm defers requests to md - > wq from md - > queue .
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11 02:52:18 +03:00
*/
2018-10-11 05:49:26 +03:00
if ( dm_request_based ( md ) )
2016-02-20 21:45:38 +03:00
dm_stop_queue ( md - > queue ) ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
dm: add request based barrier support
This patch adds barrier support for request-based dm.
CORE DESIGN
The design is basically same as bio-based dm, which emulates barrier
by mapping empty barrier bios before/after a barrier I/O.
But request-based dm has been using struct request_queue for I/O
queueing, so the block-layer's barrier mechanism can be used.
o Summary of the block-layer's behavior (which is depended by dm-core)
Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for
I/O barrier. It means that when an I/O requiring barrier is found
in the request_queue, the block-layer makes pre-flush request and
post-flush request just before and just after the I/O respectively.
After the ordered sequence starts, the block-layer waits for all
in-flight I/Os to complete, then gives drivers the pre-flush request,
the barrier I/O and the post-flush request one by one.
It means that the request_queue is stopped automatically by
the block-layer until drivers complete each sequence.
o dm-core
For the barrier I/O, treats it as a normal I/O, so no additional
code is needed.
For the pre/post-flush request, flushes caches by the followings:
1. Make the number of empty barrier requests required by target's
num_flush_requests, and map them (dm_rq_barrier()).
2. Waits for the mapped barriers to complete (dm_rq_barrier()).
If error has occurred, save the error value to md->barrier_error
(dm_end_request()).
(*) Basically, the first reported error is taken.
But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE
follows.
3. Requeue the pre/post-flush request if the error value is
DM_ENDIO_REQUEUE. Otherwise, completes with the error value
(dm_rq_barrier_work()).
The pre/post-flush work above is done in the kernel thread (kdmflush)
context, since memory allocation which might sleep is needed in
dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is
an irq-disabled context.
Also, clones of the pre/post-flush request share an original, so
such clones can't be completed using the softirq context.
Instead, complete them in the context of underlying device drivers.
It should be safe since there is no I/O dispatching during
the completion of such clones.
For suspend, the workqueue of kdmflush needs to be flushed after
the request_queue has been stopped. Otherwise, the next flush work
can be kicked even after the suspend completes.
TARGET INTERFACE
No new interface is added.
Just use the existing num_flush_requests in struct target_type
as same as bio-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11 02:52:18 +03:00
flush_workqueue ( md - > wq ) ;
2005-04-17 02:20:36 +04:00
/*
2009-04-09 03:27:15 +04:00
* At this point no more requests are entering target request routines .
* We call dm_wait_for_completion to wait for all existing requests
* to finish .
2005-04-17 02:20:36 +04:00
*/
2016-09-01 01:16:02 +03:00
r = dm_wait_for_completion ( md , task_state ) ;
2016-08-02 20:07:20 +03:00
if ( ! r )
set_bit ( dmf_suspended_flag , & md - > flags ) ;
2005-04-17 02:20:36 +04:00
2008-02-08 05:10:22 +03:00
if ( noflush )
2009-04-02 22:55:39 +04:00
clear_bit ( DMF_NOFLUSH_SUSPENDING , & md - > flags ) ;
2014-11-05 16:35:50 +03:00
if ( map )
synchronize_srcu ( & md - > io_barrier ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
2005-04-17 02:20:36 +04:00
/* were we interrupted ? */
2008-02-08 05:10:30 +03:00
if ( r < 0 ) {
2009-04-02 22:55:36 +04:00
dm_queue_flush ( md ) ;
2008-02-08 05:10:25 +03:00
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
if ( dm_request_based ( md ) )
2016-02-20 21:45:38 +03:00
dm_start_queue ( md - > queue ) ;
dm: prepare for request based option
This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()
Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request
end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().
end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that
The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.
suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.
For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.
resume
======
Starts md->queue.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:35 +04:00
2005-07-29 08:16:00 +04:00
unlock_fs ( md ) ;
2014-10-29 03:13:31 +03:00
dm_table_presuspend_undo_targets ( map ) ;
2014-10-29 01:34:52 +03:00
/* pushback list is already flushed, so skip flush */
2005-07-29 08:16:00 +04:00
}
2005-04-17 02:20:36 +04:00
2014-10-29 01:34:52 +03:00
return r ;
}
/*
* We need to be able to change a mapping table under a mounted
* filesystem . For example we might want to move some data in
* the background . Before the table can be swapped with
* dm_bind_table , dm_suspend must be called to flush any in
* flight bios and ensure that any further io gets deferred .
*/
/*
* Suspend mechanism in request - based dm .
*
* 1. Flush all I / Os by lock_fs ( ) if needed .
* 2. Stop dispatching any I / O by stopping the request_queue .
* 3. Wait for all in - flight I / Os to be completed or requeued .
*
* To abort suspend , start the request_queue .
*/
int dm_suspend ( struct mapped_device * md , unsigned suspend_flags )
{
struct dm_table * map = NULL ;
int r = 0 ;
retry :
mutex_lock_nested ( & md - > suspend_lock , SINGLE_DEPTH_NESTING ) ;
if ( dm_suspended_md ( md ) ) {
r = - EINVAL ;
goto out_unlock ;
}
if ( dm_suspended_internally_md ( md ) ) {
/* already internally suspended, wait for internal resume */
mutex_unlock ( & md - > suspend_lock ) ;
r = wait_on_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY , TASK_INTERRUPTIBLE ) ;
if ( r )
return r ;
goto retry ;
}
2014-11-23 20:34:29 +03:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2014-10-29 01:34:52 +03:00
2016-08-02 20:07:20 +03:00
r = __dm_suspend ( md , map , suspend_flags , TASK_INTERRUPTIBLE , DMF_SUSPENDED ) ;
2014-10-29 01:34:52 +03:00
if ( r )
goto out_unlock ;
2009-04-09 03:27:15 +04:00
2020-07-23 17:42:09 +03:00
set_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
2009-12-11 02:52:26 +03:00
dm_table_postsuspend_targets ( map ) ;
2020-07-23 17:42:09 +03:00
clear_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
2009-12-11 02:52:26 +03:00
2006-11-09 04:44:43 +03:00
out_unlock :
2008-02-08 05:10:08 +03:00
mutex_unlock ( & md - > suspend_lock ) ;
2005-07-29 08:15:57 +04:00
return r ;
2005-04-17 02:20:36 +04:00
}
2014-10-29 01:34:52 +03:00
static int __dm_resume ( struct mapped_device * md , struct dm_table * map )
{
if ( map ) {
int r = dm_table_resume_targets ( map ) ;
if ( r )
return r ;
}
dm_queue_flush ( md ) ;
/*
* Flushing deferred I / Os must be done after targets are resumed
* so that mapping of targets can work correctly .
* Request - based dm is queueing the deferred I / Os in its request_queue .
*/
if ( dm_request_based ( md ) )
2016-02-20 21:45:38 +03:00
dm_start_queue ( md - > queue ) ;
2014-10-29 01:34:52 +03:00
unlock_fs ( md ) ;
return 0 ;
}
2005-04-17 02:20:36 +04:00
int dm_resume ( struct mapped_device * md )
{
2016-09-06 11:00:29 +03:00
int r ;
2005-07-29 08:15:57 +04:00
struct dm_table * map = NULL ;
2005-04-17 02:20:36 +04:00
2014-10-29 01:34:52 +03:00
retry :
2016-09-06 11:00:29 +03:00
r = - EINVAL ;
2014-10-29 01:34:52 +03:00
mutex_lock_nested ( & md - > suspend_lock , SINGLE_DEPTH_NESTING ) ;
2009-12-11 02:52:26 +03:00
if ( ! dm_suspended_md ( md ) )
2005-07-29 08:15:57 +04:00
goto out ;
2014-10-29 01:34:52 +03:00
if ( dm_suspended_internally_md ( md ) ) {
/* already internally suspended, wait for internal resume */
mutex_unlock ( & md - > suspend_lock ) ;
r = wait_on_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY , TASK_INTERRUPTIBLE ) ;
if ( r )
return r ;
goto retry ;
}
2014-11-23 20:34:29 +03:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2005-07-29 08:16:00 +04:00
if ( ! map | | ! dm_table_get_size ( map ) )
2005-07-29 08:15:57 +04:00
goto out ;
2005-04-17 02:20:36 +04:00
2014-10-29 01:34:52 +03:00
r = __dm_resume ( md , map ) ;
2006-10-03 12:15:36 +04:00
if ( r )
goto out ;
2005-07-29 08:16:00 +04:00
clear_bit ( DMF_SUSPENDED , & md - > flags ) ;
2005-07-29 08:15:57 +04:00
out :
2008-02-08 05:10:08 +03:00
mutex_unlock ( & md - > suspend_lock ) ;
2005-07-29 08:16:00 +04:00
2005-07-29 08:15:57 +04:00
return r ;
2005-04-17 02:20:36 +04:00
}
2013-08-16 18:54:23 +04:00
/*
* Internal suspend / resume works like userspace - driven suspend . It waits
* until all bios finish and prevents issuing new bios to the target drivers .
* It may be used only from the kernel .
*/
2014-10-29 01:34:52 +03:00
static void __dm_internal_suspend ( struct mapped_device * md , unsigned suspend_flags )
2013-08-16 18:54:23 +04:00
{
2014-10-29 01:34:52 +03:00
struct dm_table * map = NULL ;
2017-04-27 20:11:21 +03:00
lockdep_assert_held ( & md - > suspend_lock ) ;
2015-01-09 02:52:26 +03:00
if ( md - > internal_suspend_count + + )
2014-10-29 01:34:52 +03:00
return ; /* nested internal suspend */
if ( dm_suspended_md ( md ) ) {
set_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
return ; /* nest suspend */
}
2014-11-23 20:34:29 +03:00
map = rcu_dereference_protected ( md - > map , lockdep_is_held ( & md - > suspend_lock ) ) ;
2014-10-29 01:34:52 +03:00
/*
* Using TASK_UNINTERRUPTIBLE because only NOFLUSH internal suspend is
* supported . Properly supporting a TASK_INTERRUPTIBLE internal suspend
* would require changing . presuspend to return an error - - avoid this
* until there is a need for more elaborate variants of internal suspend .
*/
2016-08-02 20:07:20 +03:00
( void ) __dm_suspend ( md , map , suspend_flags , TASK_UNINTERRUPTIBLE ,
DMF_SUSPENDED_INTERNALLY ) ;
2014-10-29 01:34:52 +03:00
2020-07-23 17:42:09 +03:00
set_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
2014-10-29 01:34:52 +03:00
dm_table_postsuspend_targets ( map ) ;
2020-07-23 17:42:09 +03:00
clear_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
2014-10-29 01:34:52 +03:00
}
static void __dm_internal_resume ( struct mapped_device * md )
{
2015-01-09 02:52:26 +03:00
BUG_ON ( ! md - > internal_suspend_count ) ;
if ( - - md - > internal_suspend_count )
2014-10-29 01:34:52 +03:00
return ; /* resume from nested internal suspend */
2013-08-16 18:54:23 +04:00
if ( dm_suspended_md ( md ) )
2014-10-29 01:34:52 +03:00
goto done ; /* resume from nested suspend */
/*
* NOTE : existing callers don ' t need to call dm_table_resume_targets
* ( which may fail - - so best to avoid it for now by passing NULL map )
*/
( void ) __dm_resume ( md , NULL ) ;
done :
clear_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
smp_mb__after_atomic ( ) ;
wake_up_bit ( & md - > flags , DMF_SUSPENDED_INTERNALLY ) ;
}
void dm_internal_suspend_noflush ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
__dm_internal_suspend ( md , DM_SUSPEND_NOFLUSH_FLAG ) ;
mutex_unlock ( & md - > suspend_lock ) ;
}
EXPORT_SYMBOL_GPL ( dm_internal_suspend_noflush ) ;
void dm_internal_resume ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
__dm_internal_resume ( md ) ;
mutex_unlock ( & md - > suspend_lock ) ;
}
EXPORT_SYMBOL_GPL ( dm_internal_resume ) ;
/*
* Fast variants of internal suspend / resume hold md - > suspend_lock ,
* which prevents interaction with userspace - driven suspend .
*/
void dm_internal_suspend_fast ( struct mapped_device * md )
{
mutex_lock ( & md - > suspend_lock ) ;
if ( dm_suspended_md ( md ) | | dm_suspended_internally_md ( md ) )
2013-08-16 18:54:23 +04:00
return ;
set_bit ( DMF_BLOCK_IO_FOR_SUSPEND , & md - > flags ) ;
synchronize_srcu ( & md - > io_barrier ) ;
flush_workqueue ( md - > wq ) ;
dm_wait_for_completion ( md , TASK_UNINTERRUPTIBLE ) ;
}
2015-02-26 19:40:35 +03:00
EXPORT_SYMBOL_GPL ( dm_internal_suspend_fast ) ;
2013-08-16 18:54:23 +04:00
2014-10-29 01:34:52 +03:00
void dm_internal_resume_fast ( struct mapped_device * md )
2013-08-16 18:54:23 +04:00
{
2014-10-29 01:34:52 +03:00
if ( dm_suspended_md ( md ) | | dm_suspended_internally_md ( md ) )
2013-08-16 18:54:23 +04:00
goto done ;
dm_queue_flush ( md ) ;
done :
mutex_unlock ( & md - > suspend_lock ) ;
}
2015-02-26 19:40:35 +03:00
EXPORT_SYMBOL_GPL ( dm_internal_resume_fast ) ;
2013-08-16 18:54:23 +04:00
2005-04-17 02:20:36 +04:00
/*-----------------------------------------------------------------
* Event notification .
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
2010-03-06 05:32:31 +03:00
int dm_kobject_uevent ( struct mapped_device * md , enum kobject_action action ,
2009-06-22 13:12:30 +04:00
unsigned cookie )
2007-12-13 17:15:57 +03:00
{
2020-07-08 19:25:20 +03:00
int r ;
unsigned noio_flag ;
2009-06-22 13:12:30 +04:00
char udev_cookie [ DM_COOKIE_LENGTH ] ;
char * envp [ ] = { udev_cookie , NULL } ;
2020-07-08 19:25:20 +03:00
noio_flag = memalloc_noio_save ( ) ;
2009-06-22 13:12:30 +04:00
if ( ! cookie )
2020-07-08 19:25:20 +03:00
r = kobject_uevent ( & disk_to_dev ( md - > disk ) - > kobj , action ) ;
2009-06-22 13:12:30 +04:00
else {
snprintf ( udev_cookie , DM_COOKIE_LENGTH , " %s=%u " ,
DM_COOKIE_ENV_VAR_NAME , cookie ) ;
2020-07-08 19:25:20 +03:00
r = kobject_uevent_env ( & disk_to_dev ( md - > disk ) - > kobj ,
action , envp ) ;
2009-06-22 13:12:30 +04:00
}
2020-07-08 19:25:20 +03:00
memalloc_noio_restore ( noio_flag ) ;
return r ;
2007-12-13 17:15:57 +03:00
}
2007-10-20 01:48:01 +04:00
uint32_t dm_next_uevent_seq ( struct mapped_device * md )
{
return atomic_add_return ( 1 , & md - > uevent_seq ) ;
}
2005-04-17 02:20:36 +04:00
uint32_t dm_get_event_nr ( struct mapped_device * md )
{
return atomic_read ( & md - > event_nr ) ;
}
int dm_wait_event ( struct mapped_device * md , int event_nr )
{
return wait_event_interruptible ( md - > eventq ,
( event_nr ! = atomic_read ( & md - > event_nr ) ) ) ;
}
2007-10-20 01:48:01 +04:00
void dm_uevent_add ( struct mapped_device * md , struct list_head * elist )
{
unsigned long flags ;
spin_lock_irqsave ( & md - > uevent_lock , flags ) ;
list_add ( elist , & md - > uevent_list ) ;
spin_unlock_irqrestore ( & md - > uevent_lock , flags ) ;
}
2005-04-17 02:20:36 +04:00
/*
* The gendisk is only valid as long as you have a reference
* count on ' md ' .
*/
struct gendisk * dm_disk ( struct mapped_device * md )
{
return md - > disk ;
}
2015-03-18 18:52:14 +03:00
EXPORT_SYMBOL_GPL ( dm_disk ) ;
2005-04-17 02:20:36 +04:00
2009-01-06 06:05:12 +03:00
struct kobject * dm_kobject ( struct mapped_device * md )
{
2014-01-14 04:37:54 +04:00
return & md - > kobj_holder . kobj ;
2009-01-06 06:05:12 +03:00
}
struct mapped_device * dm_get_from_kobject ( struct kobject * kobj )
{
struct mapped_device * md ;
2014-01-14 04:37:54 +04:00
md = container_of ( kobj , struct mapped_device , kobj_holder . kobj ) ;
2009-01-06 06:05:12 +03:00
2017-11-01 10:42:36 +03:00
spin_lock ( & _minor_lock ) ;
if ( test_bit ( DMF_FREEING , & md - > flags ) | | dm_deleting_md ( md ) ) {
md = NULL ;
goto out ;
}
2009-01-06 06:05:12 +03:00
dm_get ( md ) ;
2017-11-01 10:42:36 +03:00
out :
spin_unlock ( & _minor_lock ) ;
2009-01-06 06:05:12 +03:00
return md ;
}
2009-12-11 02:52:26 +03:00
int dm_suspended_md ( struct mapped_device * md )
2005-04-17 02:20:36 +04:00
{
return test_bit ( DMF_SUSPENDED , & md - > flags ) ;
}
2020-07-23 17:42:09 +03:00
static int dm_post_suspending_md ( struct mapped_device * md )
{
return test_bit ( DMF_POST_SUSPENDING , & md - > flags ) ;
}
2014-10-29 01:34:52 +03:00
int dm_suspended_internally_md ( struct mapped_device * md )
{
return test_bit ( DMF_SUSPENDED_INTERNALLY , & md - > flags ) ;
}
2013-11-02 02:27:41 +04:00
int dm_test_deferred_remove_flag ( struct mapped_device * md )
{
return test_bit ( DMF_DEFERRED_REMOVE , & md - > flags ) ;
}
2009-12-11 02:52:27 +03:00
int dm_suspended ( struct dm_target * ti )
{
2020-09-19 20:09:11 +03:00
return dm_suspended_md ( ti - > table - > md ) ;
2009-12-11 02:52:27 +03:00
}
EXPORT_SYMBOL_GPL ( dm_suspended ) ;
2020-07-23 17:42:09 +03:00
int dm_post_suspending ( struct dm_target * ti )
{
2020-09-19 20:09:11 +03:00
return dm_post_suspending_md ( ti - > table - > md ) ;
2020-07-23 17:42:09 +03:00
}
EXPORT_SYMBOL_GPL ( dm_post_suspending ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
int dm_noflush_suspending ( struct dm_target * ti )
{
2020-09-19 20:09:11 +03:00
return __noflush_suspending ( ti - > table - > md ) ;
[PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 13:41:09 +03:00
}
EXPORT_SYMBOL_GPL ( dm_noflush_suspending ) ;
2017-04-27 20:11:23 +03:00
struct dm_md_mempools * dm_alloc_md_mempools ( struct mapped_device * md , enum dm_queue_mode type ,
2022-03-24 21:36:47 +03:00
unsigned per_io_data_size , unsigned min_pool_size ,
bool integrity , bool poll )
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
{
2016-02-22 20:16:21 +03:00
struct dm_md_mempools * pools = kzalloc_node ( sizeof ( * pools ) , GFP_KERNEL , md - > numa_node_id ) ;
2015-06-26 17:01:13 +03:00
unsigned int pool_size = 0 ;
2017-12-12 07:17:47 +03:00
unsigned int front_pad , io_front_pad ;
2018-05-21 01:25:53 +03:00
int ret ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
if ( ! pools )
2015-06-26 16:42:57 +03:00
return NULL ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
2015-06-26 17:01:13 +03:00
switch ( type ) {
case DM_TYPE_BIO_BASED :
2016-06-23 02:54:53 +03:00
case DM_TYPE_DAX_BIO_BASED :
2017-12-08 22:40:52 +03:00
pool_size = max ( dm_get_reserved_bio_based_ios ( ) , min_pool_size ) ;
2021-01-12 08:52:00 +03:00
front_pad = roundup ( per_io_data_size , __alignof__ ( struct dm_target_io ) ) + DM_TARGET_IO_BIO_OFFSET ;
io_front_pad = roundup ( per_io_data_size , __alignof__ ( struct dm_io ) ) + DM_IO_BIO_OFFSET ;
2022-03-24 21:36:47 +03:00
ret = bioset_init ( & pools - > io_bs , pool_size , io_front_pad , poll ? BIOSET_PERCPU_CACHE : 0 ) ;
2018-05-21 01:25:53 +03:00
if ( ret )
2017-12-12 07:17:47 +03:00
goto out ;
2018-05-21 01:25:53 +03:00
if ( integrity & & bioset_integrity_create ( & pools - > io_bs , pool_size ) )
2017-01-22 20:32:46 +03:00
goto out ;
2015-06-26 17:01:13 +03:00
break ;
case DM_TYPE_REQUEST_BASED :
2017-12-08 22:40:52 +03:00
pool_size = max ( dm_get_reserved_rq_based_ios ( ) , min_pool_size ) ;
2015-06-26 17:01:13 +03:00
front_pad = offsetof ( struct dm_rq_clone_bio_info , clone ) ;
2016-01-31 20:05:42 +03:00
/* per_io_data_size is used for blk-mq pdu at queue allocation */
2015-06-26 17:01:13 +03:00
break ;
default :
BUG ( ) ;
}
2018-05-21 01:25:53 +03:00
ret = bioset_init ( & pools - > bs , pool_size , front_pad , 0 ) ;
if ( ret )
2013-03-02 02:45:48 +04:00
goto out ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
2018-05-21 01:25:53 +03:00
if ( integrity & & bioset_integrity_create ( & pools - > bs , pool_size ) )
2013-03-02 02:45:48 +04:00
goto out ;
2011-03-17 13:11:05 +03:00
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
return pools ;
2015-05-22 16:14:04 +03:00
out :
dm_free_md_mempools ( pools ) ;
2015-06-26 17:01:13 +03:00
2015-06-26 16:42:57 +03:00
return NULL ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
}
void dm_free_md_mempools ( struct dm_md_mempools * pools )
{
if ( ! pools )
return ;
2018-05-21 01:25:53 +03:00
bioset_exit ( & pools - > bs ) ;
bioset_exit ( & pools - > io_bs ) ;
dm: enable request based option
This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.
o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 13:12:36 +04:00
kfree ( pools ) ;
}
2016-07-08 15:23:51 +03:00
struct dm_pr {
u64 old_key ;
u64 new_key ;
u32 flags ;
bool fail_early ;
} ;
static int dm_call_pr ( struct block_device * bdev , iterate_devices_callout_fn fn ,
void * data )
2015-10-15 15:10:51 +03:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
2016-07-08 15:23:51 +03:00
struct dm_table * table ;
struct dm_target * ti ;
int ret = - ENOTTY , srcu_idx ;
2015-10-15 15:10:51 +03:00
2016-07-08 15:23:51 +03:00
table = dm_get_live_table ( md , & srcu_idx ) ;
if ( ! table | | ! dm_table_get_size ( table ) )
goto out ;
2015-10-15 15:10:51 +03:00
2016-07-08 15:23:51 +03:00
/* We only support devices that have a single target */
if ( dm_table_get_num_targets ( table ) ! = 1 )
goto out ;
ti = dm_table_get_target ( table , 0 ) ;
2015-10-15 15:10:51 +03:00
2016-07-08 15:23:51 +03:00
ret = - EINVAL ;
if ( ! ti - > type - > iterate_devices )
goto out ;
ret = ti - > type - > iterate_devices ( ti , fn , data ) ;
out :
dm_put_live_table ( md , srcu_idx ) ;
return ret ;
}
/*
* For register / unregister we need to manually call out to every path .
*/
static int __dm_pr_register ( struct dm_target * ti , struct dm_dev * dev ,
sector_t start , sector_t len , void * data )
{
struct dm_pr * pr = data ;
const struct pr_ops * ops = dev - > bdev - > bd_disk - > fops - > pr_ops ;
if ( ! ops | | ! ops - > pr_register )
return - EOPNOTSUPP ;
return ops - > pr_register ( dev - > bdev , pr - > old_key , pr - > new_key , pr - > flags ) ;
}
static int dm_pr_register ( struct block_device * bdev , u64 old_key , u64 new_key ,
u32 flags )
{
struct dm_pr pr = {
. old_key = old_key ,
. new_key = new_key ,
. flags = flags ,
. fail_early = true ,
} ;
int ret ;
ret = dm_call_pr ( bdev , __dm_pr_register , & pr ) ;
if ( ret & & new_key ) {
/* unregister all paths if we failed to register any path */
pr . old_key = new_key ;
pr . new_key = 0 ;
pr . flags = 0 ;
pr . fail_early = false ;
dm_call_pr ( bdev , __dm_pr_register , & pr ) ;
}
return ret ;
2015-10-15 15:10:51 +03:00
}
static int dm_pr_reserve ( struct block_device * bdev , u64 key , enum pr_type type ,
2016-02-19 00:13:51 +03:00
u32 flags )
2015-10-15 15:10:51 +03:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
2018-04-03 22:05:12 +03:00
int r , srcu_idx ;
2015-10-15 15:10:51 +03:00
2018-04-03 23:54:10 +03:00
r = dm_prepare_ioctl ( md , & srcu_idx , & bdev ) ;
2015-10-15 15:10:51 +03:00
if ( r < 0 )
2018-04-03 22:05:12 +03:00
goto out ;
2015-10-15 15:10:51 +03:00
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_reserve )
r = ops - > pr_reserve ( bdev , key , type , flags ) ;
else
r = - EOPNOTSUPP ;
2018-04-03 22:05:12 +03:00
out :
dm_unprepare_ioctl ( md , srcu_idx ) ;
2015-10-15 15:10:51 +03:00
return r ;
}
static int dm_pr_release ( struct block_device * bdev , u64 key , enum pr_type type )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
2018-04-03 22:05:12 +03:00
int r , srcu_idx ;
2015-10-15 15:10:51 +03:00
2018-04-03 23:54:10 +03:00
r = dm_prepare_ioctl ( md , & srcu_idx , & bdev ) ;
2015-10-15 15:10:51 +03:00
if ( r < 0 )
2018-04-03 22:05:12 +03:00
goto out ;
2015-10-15 15:10:51 +03:00
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_release )
r = ops - > pr_release ( bdev , key , type ) ;
else
r = - EOPNOTSUPP ;
2018-04-03 22:05:12 +03:00
out :
dm_unprepare_ioctl ( md , srcu_idx ) ;
2015-10-15 15:10:51 +03:00
return r ;
}
static int dm_pr_preempt ( struct block_device * bdev , u64 old_key , u64 new_key ,
2016-02-19 00:13:51 +03:00
enum pr_type type , bool abort )
2015-10-15 15:10:51 +03:00
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
2018-04-03 22:05:12 +03:00
int r , srcu_idx ;
2015-10-15 15:10:51 +03:00
2018-04-03 23:54:10 +03:00
r = dm_prepare_ioctl ( md , & srcu_idx , & bdev ) ;
2015-10-15 15:10:51 +03:00
if ( r < 0 )
2018-04-03 22:05:12 +03:00
goto out ;
2015-10-15 15:10:51 +03:00
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_preempt )
r = ops - > pr_preempt ( bdev , old_key , new_key , type , abort ) ;
else
r = - EOPNOTSUPP ;
2018-04-03 22:05:12 +03:00
out :
dm_unprepare_ioctl ( md , srcu_idx ) ;
2015-10-15 15:10:51 +03:00
return r ;
}
static int dm_pr_clear ( struct block_device * bdev , u64 key )
{
struct mapped_device * md = bdev - > bd_disk - > private_data ;
const struct pr_ops * ops ;
2018-04-03 22:05:12 +03:00
int r , srcu_idx ;
2015-10-15 15:10:51 +03:00
2018-04-03 23:54:10 +03:00
r = dm_prepare_ioctl ( md , & srcu_idx , & bdev ) ;
2015-10-15 15:10:51 +03:00
if ( r < 0 )
2018-04-03 22:05:12 +03:00
goto out ;
2015-10-15 15:10:51 +03:00
ops = bdev - > bd_disk - > fops - > pr_ops ;
if ( ops & & ops - > pr_clear )
r = ops - > pr_clear ( bdev , key ) ;
else
r = - EOPNOTSUPP ;
2018-04-03 22:05:12 +03:00
out :
dm_unprepare_ioctl ( md , srcu_idx ) ;
2015-10-15 15:10:51 +03:00
return r ;
}
static const struct pr_ops dm_pr_ops = {
. pr_register = dm_pr_register ,
. pr_reserve = dm_pr_reserve ,
. pr_release = dm_pr_release ,
. pr_preempt = dm_pr_preempt ,
. pr_clear = dm_pr_clear ,
} ;
2009-09-22 04:01:13 +04:00
static const struct block_device_operations dm_blk_dops = {
2020-07-01 11:59:43 +03:00
. submit_bio = dm_submit_bio ,
2022-03-05 05:08:04 +03:00
. poll_bio = dm_poll_bio ,
2005-04-17 02:20:36 +04:00
. open = dm_blk_open ,
. release = dm_blk_close ,
2006-10-03 12:15:15 +04:00
. ioctl = dm_blk_ioctl ,
2006-03-27 13:17:54 +04:00
. getgeo = dm_blk_getgeo ,
2018-10-12 13:08:49 +03:00
. report_zones = dm_blk_report_zones ,
2015-10-15 15:10:51 +03:00
. pr_ops = & dm_pr_ops ,
2005-04-17 02:20:36 +04:00
. owner = THIS_MODULE
} ;
2020-10-07 23:41:01 +03:00
static const struct block_device_operations dm_rq_blk_dops = {
. open = dm_blk_open ,
. release = dm_blk_close ,
. ioctl = dm_blk_ioctl ,
. getgeo = dm_blk_getgeo ,
. pr_ops = & dm_pr_ops ,
. owner = THIS_MODULE
} ;
2017-04-12 22:35:44 +03:00
static const struct dax_operations dm_dax_ops = {
. direct_access = dm_dax_direct_access ,
2020-02-28 19:34:54 +03:00
. zero_page_range = dm_dax_zero_page_range ,
2022-04-23 01:45:06 +03:00
. recovery_write = dm_dax_recovery_write ,
2017-04-12 22:35:44 +03:00
} ;
2005-04-17 02:20:36 +04:00
/*
* module hooks
*/
module_init ( dm_init ) ;
module_exit ( dm_exit ) ;
module_param ( major , uint , 0 ) ;
MODULE_PARM_DESC ( major , " The major number of the device mapper " ) ;
2013-09-13 02:06:12 +04:00
2013-09-13 02:06:12 +04:00
module_param ( reserved_bio_based_ios , uint , S_IRUGO | S_IWUSR ) ;
MODULE_PARM_DESC ( reserved_bio_based_ios , " Reserved IOs in bio-based mempools " ) ;
2016-02-22 20:16:21 +03:00
module_param ( dm_numa_node , int , S_IRUGO | S_IWUSR ) ;
MODULE_PARM_DESC ( dm_numa_node , " NUMA node for DM device memory allocations " ) ;
2021-02-10 23:26:23 +03:00
module_param ( swap_bios , int , S_IRUGO | S_IWUSR ) ;
MODULE_PARM_DESC ( swap_bios , " Maximum allowed inflight swap IOs " ) ;
2005-04-17 02:20:36 +04:00
MODULE_DESCRIPTION ( DM_NAME " driver " ) ;
MODULE_AUTHOR ( " Joe Thornber <dm-devel@redhat.com> " ) ;
MODULE_LICENSE ( " GPL " ) ;