Commit Graph

3185 Commits

Author SHA1 Message Date
Mike Snitzer
fe76cd88e6 dm thin: fix dangling bio in process_deferred_bios error path
If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-03-28 14:37:02 -04:00
Jose Castillo
a356e42620 dm mpath: print more useful warnings in multipath_message()
The warning message "Unrecognised multipath message received" is
displayed in two different situations in multipath_message(): when the
number of arguments passed is invalid and when the string passed in
argv[0] is not recognized.

Make it easier to identify where the problem is by making these warnings
more specific with additional context for each case.

Signed-off-by: Jose Castillo <jcastillo@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
3a01750964 dm-mpath: do not activate failed paths
activate_path() is run without a lock, so the path might be
set to failed before activate_path() had a chance to run.
This patch add a check for ->active in activate_path() to
avoid unnecessary overhead by calling functions which are known
to be failing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:25 -04:00
Mike Snitzer
9bf59a611a dm mpath: remove extra nesting in map function
Return early for case when no path exists, and when the
pathgroup isn't ready. This eliminates the need for
extra nesting for the the common case.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
36fcffcc65 dm mpath: remove map_io()
multipath_map() is now just a wrapper around map_io(), so we
can rename map_io() to multipath_map().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
e3bde04f1e dm mpath: reduce memory pressure when requeuing
When multipath needs to requeue I/O in the block layer the per-request
context shouldn't be allocated, as it will be freed immediately
afterwards anyway.  Avoiding this memory allocation will reduce memory
pressure during requeuing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:25 -04:00
Hannes Reinecke
3e9f1be1b4 dm mpath: remove process_queued_ios()
process_queued_ios() has served 3 functions:
  1) select pg and pgpath if none is selected
  2) start pg_init if requested
  3) dispatch queued IOs when pg is ready

Basically, a call to queue_work(process_queued_ios) can be replaced by
dm_table_run_md_queue_async(), which runs request queue and ends up
calling map_io(), which does 1), 2) and 3).

Exception is when !pg_ready() (which means either pg_init is running or
requested), then multipath_busy() prevents map_io() being called from
request_fn.

If pg_init is running, it should be ok as long as pg_init_done() does
the right thing when pg_init is completed, I.e.: restart pg_init if
!pg_ready() or call dm_table_run_md_queue_async() to kick map_io().

If pg_init is requested, we have to make sure the request is detected
and pg_init will be started.  pg_init is requested in 3 places:
  a) __choose_pgpath() in map_io()
  b) __choose_pgpath() in multipath_ioctl()
  c) pg_init retry in pg_init_done()
a) is ok because map_io() calls __pg_init_all_paths(), which does 2).
b) needs a call to __pg_init_all_paths(), which does 2).
c) needs a call to __pg_init_all_paths(), which does 2).

So this patch removes process_queued_ios() and ensures that
__pg_init_all_paths() is called at the appropriate locations.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Hannes Reinecke
e809917735 dm mpath: push back requests instead of queueing
There is no reason why multipath needs to queue requests internally for
queue_if_no_path or pg_init; we should rather push them back onto the
request queue.

And while we're at it we can simplify the conditional statement in
map_io() to make it easier to read.

Since mpath no longer does internal queuing of I/O the table info no
longer emits the internal queue_size.  Instead it displays 1 if queuing
is being used or 0 if it is not.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Mike Snitzer
9974fa2c6a dm table: add dm_table_run_md_queue_async
Introduce dm_table_run_md_queue_async() to run the request_queue of the
mapped_device associated with a request-based DM table.

Also add dm_md_get_queue() wrapper to extract the request_queue from a
mapped_device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Hannes Reinecke
17f4ff45b5 dm mpath: do not call pg_init when it is already running
This patch moves condition checks as a preparation of following
patches and has no effect on behaviour.
process_queued_ios() is the only caller of __pg_init_all_paths()
and 2 condition checks are moved from outside to inside without
side effects.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27 16:56:24 -04:00
Monam Agarwal
9cdb852004 dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
Replace rcu_assign_pointer(p, NULL) with RCU_INIT_POINTER(p, NULL).

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.  And in the
case of the NULL pointer, there is no structure to initialize.  So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL).

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
bfc6d41cee dm: stop using bi_private
Device mapper uses the bio structure's bi_private field as a pointer
to dm_target_io or dm_rq_clone_bio_info.  But a bio structure is
embedded in the dm_target_io and dm_rq_clone_bio_info structures, so the
pointer to the structure that contains the bio can be found with the
container_of() macro.

Remove the use of bi_private and use container_of() instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
d70ab4fb72 dm: remove dm_get_mapinfo
Remove dm_get_mapinfo() because no target uses it.  Targets can allocate
per-bio data using ti->per_bio_data_size, this is much more flexible
than union map_info.

Leave union map_info only for the request-based multipath target's use.
Also delete the unused "unsigned long long ll" field of union map_info.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:24 -04:00
Mikulas Patocka
473c36dfee dm: make dm_table_alloc_md_mempools static
Make the function dm_table_alloc_md_mempools static because it is not
called from another file.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Joe Thornber
5a32083d03 dm: take care to copy the space map roots before locking the superblock
In theory copying the space map root can fail, but in practice it never
does because we're careful to check what size buffer is needed.

But make certain we're able to copy the space map roots before
locking the superblock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed
2014-03-27 16:56:23 -04:00
Joe Thornber
a9d45396f5 dm transaction manager: fix corruption due to non-atomic transaction commit
The persistent-data library used by dm-thin, dm-cache, etc is
transactional.  If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.

Atomicity when committing a transaction is achieved by:

a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
   disk.

This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.

Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.

As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush().  Now
the unlocking must be done separately.

This issue was discovered by forcing io errors at the crucial time
using dm-flakey.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Heinz Mauelshagen
64ab346a36 dm cache: remove remainder of distinct discard block size
Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.

Completely remove all code that enabled a distinction between the
cache block size and discard block size.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Mike Snitzer
d132cc6d9e dm cache: prevent corruption caused by discard_block_size > cache_block_size
If the discard block size is larger than the cache block size we will
not properly quiesce IO to a region that is about to be discarded.  This
results in a race between a cache migration where no copy is needed, and
a write to an adjacent cache block that's within the same large discard
block.

Workaround this by limiting the discard_block_size to cache_block_size.
Also limit the max_discard_sectors to cache_block_size.

A more comprehensive fix that introduces range locking support in the
bio_prison and proper quiescing of a discard range that spans multiple
cache blocks is already in development.

Reported-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Joe Thornber
428e469864 dm bitset: only flush the current word if it has been dirtied
This change offers a big performance boost for dm-era.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Joe Thornber
eec40579d8 dm: add era target
dm-era is a target that behaves similar to the linear target.  In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'.  Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.

Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.

dm-era is primarily expected to be paired with the dm-cache target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Heinz Mauelshagen
e893fba90c dm cache: fix access beyond end of origin device
In order to avoid wasting cache space a partial block at the end of the
origin device is not cached.  Unfortunately, the check for such a
partial block at the end of the origin device was flawed.

Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:

- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset

Otherwise, users can experience errors like the following:

 attempt to access beyond end of device
 dm-5: rw=0, want=20971520, limit=20971456
 ...
 device-mapper: cache: promotion failed; couldn't copy block

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:52:00 -04:00
Heinz Mauelshagen
8b9d966665 dm cache: fix truncation bug when copying a block to/from >2TB fast device
During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits.  The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().

Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.

Here is an example of how this bug manifests on an ext4 filesystem:

 EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
 JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:49:27 -04:00
Joe Thornber
cebc2de44d dm space map metadata: fix refcount decrement below 0 which caused corruption
This has been a relatively long-standing issue that wasn't nailed down
until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014,
see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html

From that report:
  "When decreasing the reference count of a metadata block with its
  reference count equals 3, we will call dm_btree_remove() to remove
  this enrty from the B+tree which keeps the reference count info in
  metadata device.

  The B+tree will try to rebalance the entry of the child nodes in each
  node it traversed, and the rebalance process contains the following
  steps.

  (1) Finding the corresponding children in current node (shadow_current(s))
  (2) Shadow the children block (issue BOP_INC)
  (3) redistribute keys among children, and free children if necessary (issue BOP_DEC)

  Since the update of a metadata block's reference count could be
  recursive, we will stash these reference count update operations in
  smm->uncommitted and then process them in a FILO fashion.

  The problem is that step(3) could free the children which is created
  in step(2), so the BOP_DEC issued in step(3) will be carried out
  before the BOP_INC issued in step(2) since these BOPs will be
  processed in FILO fashion. Once the BOP_DEC from step(3) tries to
  decrease the reference count of newly shadow block, it will report
  failure for its reference equals 0 before decreasing. It looks like we
  can solve this issue by processing these BOPs in a FIFO fashion
  instead of FILO."

Commit 5b564d80 ("dm space map: disallow decrementing a reference count
below zero") changed the code to report an error for this temporary
refcount decrement below zero.  So what was previously a harmless
invalid refcount became a hard failure due to the new error path:

 device-mapper: space map common: unable to decrement a reference count below 0
 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
 device-mapper: thin: 253:6: switching pool to read-only mode

This bug is in dm persistent-data code that is common to the DM thin and
cache targets.  So any users of those targets should apply this fix.

Fix this by applying recursive space map operations in FIFO order rather
than FILO.

Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801

Reported-by: Apollon Oikonomopoulos <apoikos@debian.org>
Reported-by: edwillam1007@gmail.com
Reported-by: Teng-Feng Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-03-07 12:02:47 -05:00
Joe Thornber
738211f70a dm thin: fix noflush suspend IO queueing
i) by the time DM core calls the postsuspend hook the dm_noflush flag
has been cleared.  So the old thin_postsuspend did nothing.  We need to
use the presuspend hook instead.

ii) There was a race between bios leaving DM core and arriving in the
deferred queue.

thin_presuspend now sets a 'requeue' flag causing all bios destined for
that thin to be requeued back to DM core.  Then it requeues all held IO,
and all IO on the deferred queue (destined for that thin).  Finally
postsuspend clears the 'requeue' flag.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:59 -05:00
Joe Thornber
18adc57779 dm thin: fix deadlock in __requeue_bio_list
The spin lock in requeue_io() was held for too long, allowing deadlock.
Don't worry, due to other issues addressed in the following "dm thin:
fix noflush suspend IO queueing" commit, this code was never called.

Fix this by taking the spin lock for a much shorter period of time.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:58 -05:00
Joe Thornber
3e1a069909 dm thin: fix out of data space handling
Ideally a thin pool would never run out of data space; the low water
mark would trigger userland to extend the pool before we completely run
out of space.  However, many small random IOs to unprovisioned space can
consume data space at an alarming rate.  Adjust your low water mark if
you're frequently seeing "out-of-data-space" mode.

Before this fix, if data space ran out the pool would be put in
PM_READ_ONLY mode which also aborted the pool's current metadata
transaction (data loss for any changes in the transaction).  This had a
side-effect of needlessly compromising data consistency.  And retry of
queued unserviceable bios, once the data pool was resized, could
initiate changes to potentially inconsistent pool metadata.

Now when the pool's data space is exhausted transition to a new pool
mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
may not be allocated.  This allows users to remove thin volumes or
discard data to recover data space.

The pool is no longer put in PM_READ_ONLY mode in response to the pool
running out of data space.  And PM_READ_ONLY mode no longer aborts the
pool's current metadata transaction.  Also, set_pool_mode() will now
notify userspace when the pool mode is changed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:58 -05:00
Mike Snitzer
07f2b6e038 dm thin: ensure user takes action to validate data and metadata consistency
If a thin metadata operation fails the current transaction will abort,
whereby causing potential for IO layers up the stack (e.g. filesystems)
to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
thin metadata's superblock which:
1) requires the user verify the thin metadata is consistent (e.g. use
   thin_check, etc)
2) suggests the user verify the thin data is consistent (e.g. use fsck)

The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
to run thin_repair.

On metadata operation failure: abort current metadata transaction, set
pool in read-only mode, and now set the needs_check flag.

As part of this change, constraints are introduced or relaxed:
* don't allow a pool to transition to write mode if needs_check is set
* don't allow data or metadata space to be resized if needs_check is set
* if a thin pool's metadata space is exhausted: the kernel will now
  force the user to take the pool offline for repair before the kernel
  will allow the metadata space to be extended.

Also, update Documentation to include information about when the thin
provisioning target commits metadata, how it handles metadata failures
and running out of space.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2014-03-05 15:25:35 -05:00
Mike Snitzer
cdc2b41584 dm thin: synchronize the pool mode during suspend
Commit b5330655 ("dm thin: handle metadata failures more consistently")
increased potential for the pool's mode to be changed in response to
metadata operation failures.

When the pool mode is changed it isn't synchronized with the mode in
pool_features stored in the target's context (ti->private) that is used
as the basis for (re)establishing the pool mode during resume via
bind_control_target.

It is important that we synchronize the pool mode when it is changed
otherwise the pool may experience and unexpected mode transition on the
next resume (especially if there was no new table load).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-03-04 11:17:51 -05:00
Mikulas Patocka
2c945820ca dm snapshot: fix metadata corruption
Commit 55494bf294 ("dm snapshot: use dm-bufio") broke snapshots.
Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
involved reading exception areas one by one into ps->area and inserting
those exceptions into the hash table.  Commit 55494bf294 changed
it so that dm-bufio with prefetch is used to load exceptions in batchs.
Exceptions are loaded correctly, but ps->area is left uninitialized.
When a new exception is allocated, it is stored in this uninitialized
ps->area which will be written to the disk.  This causes metadata
corruption.

Fix this corruption by copying the last area that was read via dm-bufio
into ps->area.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-03 17:58:13 -05:00
Mike Snitzer
c64d240df3 dm: fix Kconfig indentation
Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.

Doing so fixes indentation for other DM config options.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-03 17:31:07 -05:00
Heinz Mauelshagen
14f398ca2f dm cache mq: fix memory allocation failure for large cache devices
The memory allocated for the multiqueue policy's hash table doesn't need
to be physically contiguous.  Use vzalloc() instead of kzalloc().
Fedora has been carrying this fix since 10/10/2013.

Failure seen during creation of a 10TB cached device with a 2048 sector
block size and 411GB cache size:

 dmsetup: page allocation failure: order:9, mode:0x10c0d0
 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3
 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a       12/30/2011
  000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928
  ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928
  ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0
 Call Trace:
  [<ffffffff81387ab4>] dump_stack+0x19/0x1b
  [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124
  [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e
  [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e
  [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f
  [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88
  [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b
  [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq]
  [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq]
  [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache]
  [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache]
  [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache]
  [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod]
  [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod]
  [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod]
  [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod]
  [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod]
  [<ffffffff81101181>] vfs_ioctl+0x21/0x34
  [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4
  [<ffffffff810f4d2e>] ? ____fput+0x9/0xb
  [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92
  [<ffffffff81101a68>] SyS_ioctl+0x52/0x82
  [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-28 12:18:29 -05:00
Heinz Mauelshagen
e0d849fad7 dm cache: fix truncation bug when mapping I/O to >2TB fast device
When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits.  The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.

Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-28 09:23:02 -05:00
Mike Snitzer
7d48935eff dm thin: allow metadata space larger than supported to go unused
It was always intended that a user could provide a thin metadata device
that is larger than the max supported by the on-disk format.  The extra
space would just go unused.

Unfortunately that never worked.  If the user attempted to use a larger
metadata device on creation they would get an error like the following:

 device-mapper: space map common: space map too large
 device-mapper: transaction manager: couldn't create metadata space map
 device-mapper: thin metadata: tm_create_with_sm failed
 device-mapper: table: 252:17: thin-pool: Error creating metadata object
 device-mapper: ioctl: error adding target to table

Fix this by allowing the initial metadata space map creation to cap its
size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).

Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
the sizeof the disk_bitmap_header.  So the supported maximum metadata
size is a bit smaller (reduced from 33423360 to 33292800 sectors).

Lastly, remove the "excess space will not be used" warning message from
get_metadata_dev_size(); it resulted in printing the warning multiple
times.  Factor out warn_if_metadata_device_too_big(), call it from
pool_ctr() and maybe_resize_metadata_dev().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-27 11:49:08 -05:00
Hannes Reinecke
a1989b3300 dm mpath: fix stalls when handling invalid ioctls
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not.  So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately.  This fix resolves numerous instances of:

 udevd[]: worker [] unexpectedly returned with status 0x0100

that have been seen during testing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-26 09:44:44 -05:00
Mike Snitzer
1acacc0784 dm thin: fix the error path for the thin device constructor
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:

 device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
 device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.

Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-02-24 11:41:18 -05:00
Mikulas Patocka
f3a44fe060 dm raid1: fix immutable biovec related BUG when retrying read bio
When restoring bi_end_io, increase bi_remaining before retrying the bio
to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-02-18 10:48:57 -05:00
Mikulas Patocka
d73f990729 dm io: fix I/O to multiple destinations
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") broke dm-mirror due to dm-io breakage.

dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
DM_IO_VMA) that iterate over pages where the I/O should be performed.

The switch to immutable biovecs changed the DM_IO_BVEC iterator to
DM_IO_BIO.  Before this change the iterator stored the pointer to a bio
vector in the dpages structure.  The iterator incremented the pointer in
the dpages structure as it advanced over the pages.  After the immutable
biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
the dpages structure and uses bio_advance to change the bio as it
advances.

The problem is that the function dispatch_io stores the content of the
dpages structure into the variable old_pages and restores it before
issuing I/O to each of the devices.  Before the change, the statement
"*dp = old_pages;" restored the iterator to its starting position.
After the change, struct dpages holds a pointer to the bio, thus the
statement "*dp = old_pages;" doesn't restore the iterator.

Consequently, in the context of dm-mirror: only the first mirror leg is
written correctly, the kernel locks up when trying to write the other
mirror legs because the number of sectors to write in the where->count
variable doesn't match the number of sectors returned by the iterator.

This patch fixes the bug by partially reverting the original patch - it
changes the code so that struct dpages holds a pointer to the bio vector,
so that the statement "*dp = old_pages;" restores the iterator correctly.

The field "context_u" holds the offset from the beginning of the current
bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-02-17 11:00:05 -05:00
Mike Snitzer
4d1662a30d dm thin: avoid metadata commit if a pool's thin devices haven't changed
Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit.  This commit occurs regardless of whether
any thin devices have made changes.

Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().

Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-02-17 11:00:05 -05:00
Mike Snitzer
80ae49aaed dm cache: do not add migration to completed list before unhooking bio
When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().

Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.

Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-17 11:00:05 -05:00
Mike Snitzer
c6eda5e81c dm cache: move hook_info into common portion of per_bio_data structure
Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.

Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure.  Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-02-17 11:00:05 -05:00
Linus Torvalds
bd3813d52d Two bugfixes for md
both tagged for -stable
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAUvxBNjnsnt1WYoG5AQLymhAAnKznI2YhFVqK21mpo1l2JDSkwxqIBvBZ
 hcW24zF6dNU4cJFmRQqOeL2AkzHWSqX4/J/DGXvI9wFll1CkdNs+UVQJ12Pod3gK
 gTDmqRCe/x+bQxrOR5VfyKv0slia12vn9mqfDd2mX41wcr7ceHsdHbemPhgIcUCC
 WLERQi9Yn/Eb2+rltTzZ3XaHwIlIozqZ0yRZ6wH45iyuk+uiholEJjJp8LOWpzTe
 rKE4s5qd1NAAJsrMHZ11mZWq/4VtgYJ3AcWVXVWqBPxmlI0FnBPU/KVpJkAcrVjB
 N6tqmR1/nHcrGlaOgWSS6UfNGVMe3L2HJpaIdjTM65Tdb+WFpEPevTy9qYsLC3Ic
 zV/KmErUtSFMJKYBr9YyRnSpXtnSDo8BeRsWJm9ZaA5UV9yUVBNwWDFNFP/Bkqze
 v4wLMRj54U5fjRZBq/PaFbk/A2nDCkGHC4uZCgJ+Mwhoo6rxpho/oKBjBBlmpw3q
 4Q0yWgZ8F/ZWFUrGzi1TY3tdYrl3yCOpZ3l5aRTtTqlU3aVShIIiKCKDvs2v8l6h
 C5igUbnW5BtsMMCOwdULc/lHgN3vMbJEA+7YdmeouDEY5QAk0O6nxan3y+cbtC5u
 F+++tkWzSQZJRGhdAxdAXsABYfHiR7Wnft96+iMpnQYbm35CdYYwlOhhl0iI/+Ec
 FcpDXOz9faA=
 =J3I5
 -----END PGP SIGNATURE-----

Merge tag 'md/3.14-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Two bugfixes for md

  both tagged for -stable"

* tag 'md/3.14-fixes' of git://neil.brown.name/md:
  md/raid5: Fix CPU hotplug callback registration
  md/raid1: restore ability for check and repair to fix read errors.
2014-02-14 12:48:16 -08:00
Oleg Nesterov
789b5e0315 md/raid5: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-02-13 13:46:45 +11:00
NeilBrown
1877db7558 md/raid1: restore ability for check and repair to fix read errors.
commit 30bc9b5387
    md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug).  So suitable for any -stable since 3.10.

Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b5387
Signed-off-by: NeilBrown <neilb@suse.de>
2014-02-05 12:26:04 +11:00
Jens Axboe
96d2e8b5e2 Merge branch 'bcache-for-3.14' of git://evilpiepirate.org/~kent/linux-bcache into for-linus 2014-01-30 12:57:55 -07:00
Linus Torvalds
53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Nicholas Swenson
e3b4825b85 bcache: bugfix - gc thread now gets woken when cache is full
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-29 13:06:42 -08:00
Kent Overstreet
3572324af0 bcache: Minor fixes from kbuild robot
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-29 13:06:41 -08:00
Darrick J. Wong
9471744767 bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:

BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
	     GC_SECTORS_USED(g) + KEY_SIZE(k),
	     (1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));

In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits.  Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field.  In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.

Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED.  Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2014-01-29 13:06:15 -08:00
Linus Torvalds
5c85121bf6 md updates for 3.14
All bug fixes, two tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAUuG8WTnsnt1WYoG5AQKW7A//V93TYUiUAG6zNUFrNZjuoXP0ym3jlpkH
 eIIFdcV7rr0Irgtd8+s9cW8Cjsbq3d/vMbFlwP1Co32mnCnFFojeKtCvM9GqkYrH
 o4Zr1nVAVYzKO4awByK3wBT9WbzEc/XlDgYQIpZExIYeZdzLOm6HyvlRbcE86Ug5
 QoGYOUlLu4LUZmFgB9zQ7JM0GACV5pS1afSObtACj2t2x5GVHNU84u+M+D8urPXO
 wnf+AIAzquh5F+8MX+DxmMEUaSzUHf8fXOM3jYVbzPI71SpaHssL4SwBn+j4I/8/
 SCSqeIh7qMSuqUy63/iHKCy5qAgNuRdL9fYlOTkpxzHm81Ddj8u7fySsApggVOa2
 yeKTkSRlsMFeu+LiGKNi/fINVxboaoYJVZ2DTNtKxSuW2VL2aPNz1Qjq4QnR3nSI
 LpaB3VeVKdMsH8Em1a8cgZWcjo5YFAcNtUnJq2fvj9VZ3SJNw4ZoKDL+l718iGao
 xIwAXMSafAHQVAAaNVFkwrea13TeOyxikY5Ra4vWfm+Fw8TzmYq5DqO0zaILwdAJ
 2FnNj2/2y3hk2K7qBcEvjjEakxPlTwzrzxZMfJDRMuQLqvrjbXiMGOnWzgl1D/9x
 4/uPjeFZLG7byxmIyg4Y83NkPgkWnRPpGK98r26pUH1UgnRF0a5aUFXQk7rsQrU6
 noRkZ9EPD8s=
 =d81E
 -----END PGP SIGNATURE-----

Merge tag 'md/3.14' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "All bug fixes, two tagged for -stable"

* tag 'md/3.14' of git://neil.brown.name/md:
  md/raid5: close recently introduced race in stripe_head management.
  md/raid5: fix long-standing problem with bitmap handling on write failure.
  md: check command validity early in md_ioctl().
  md: ensure metadata is writen after raid level change.
  md/raid10: avoid fullsync when not necessary.
  md: allow a partially recovered device to be hot-added to an array.
  md: Change handling of save_raid_disk and metadata update during recovery.
2014-01-24 17:41:50 -08:00