3365 Commits

Author SHA1 Message Date
Slava Pestov
60ae81eee8 bcache: bcache_write tracepoint was crashing
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
8e09480806 bcache: fix typo in bch_bkey_equal_header
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Kent Overstreet
501d52a90c bcache: Allocate bounce buffers with GFP_NOWAIT
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
2014-08-04 15:23:03 -07:00
Kent Overstreet
bcf090e004 bcache: Make sure to pass GFP_WAIT to mempool_alloc()
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
9e5c353510 bcache: fix uninterruptible sleep in writeback thread
There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
c5aa4a3157 bcache: wait for buckets when allocating new btree root
Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
a664d0f05a bcache: fix crash on shutdown in passthrough mode
We never started the writeback thread in this case, so don't stop it.
2014-08-04 15:23:03 -07:00
Slava Pestov
e5112201c1 bcache: fix lockdep warnings on shutdown 2014-08-04 15:23:03 -07:00
Slava Pestov
8b326d3a2a bcache allocator: send discards with correct size 2014-08-04 15:23:03 -07:00
Surbhi Palande
dbd810ab67 bcache: Fix to remove the rcu_sched stalls.
while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
9aa61a992a bcache: Fix a journal replay bug
journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
5b1016e62f bcache: Fix a bug when detaching
After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Mikulas Patocka
56b1ebf2d9 dm switch: efficiently support repetitive patterns
Add support for quickly loading a repetitive pattern into the
dm-switch target.

In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments.  "n" and "m" are hexadecimal numbers.  The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.

For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2

Requested-by: Jay Wang <jwang@nimblestorage.com>
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:37 -04:00
Mikulas Patocka
99eb1908e6 dm switch: factor out switch_region_table_read
Move code that reads the table to a switch_region_table_read.
It will be needed for the next commit.  No functional change.

Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer
b02465308f dm cache: set minimum_io_size to cache's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size.  This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer
fdfb4c8c1a dm thin: set minimum_io_size to pool's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the thin-pool's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update pool_io_hints() to set both minimum_io_size and optimal_io_size
to the thin-pool's data block size.  This fixes an issue reported where
mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
on a normal linear LV of comparable size, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1003227

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00
Mikulas Patocka
298a9fa08a dm crypt: use per-bio data
Change dm-crypt so that it uses auxiliary data allocated with the bio.

Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it).  It
previously only used mempool allocations.

Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.

This patch changes it so that the first dm_crypt_io and ablkcipher_request
are allocated with the bio (using target per_bio_data_size option).  If
the request needs additional values, they are allocated from the mempool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00
Mikulas Patocka
a7ffb6a533 dm table: make dm_table_supports_discards static
The function dm_table_supports_discards is only called from
dm-table.c:dm_table_set_restrictions().  So move it above
dm_table_set_restrictions and make it static.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:34 -04:00
Mike Snitzer
895b47d798 dm cache metadata: use dm-space-map-metadata.h defined size limits
Commit 7d48935e cleaned up the persistent-data's space-map-metadata
limits by elevating them to dm-space-map-metadata.h.  Update
dm-cache-metadata to use these same limits.

The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
sizeof the disk_bitmap_header.  So the supported maximum metadata size
is a bit smaller (reduced from 33423360 to 33292800 sectors).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber
304affaa88 dm cache: fail migrations in the do_worker error path
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber
8c081b52c6 dm cache: simplify deferred set reference count increments
Factor out inc_and_issue and inc_ds helpers to simplify deferred set
reference count increments.  Also cleanup cache_map to consistently call
cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.

No functional change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:32 -04:00
Joe Thornber
e5aea7b49f dm thin: relax external origin size constraints
Track the size of any external origin.  Previously the external origin's
size had to be a multiple of the thin-pool's block size, that is no
longer a requirement.  In addition, snapshots that are larger than the
external origin are now supported.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:32 -04:00
Joe Thornber
50f3c3efdd dm thin: switch to an atomic_t for tracking pending new block preparations
Previously we used separate boolean values to track quiescing and
copying actions.  By switching to an atomic_t we can support blocks that
need a partial copy and partial zero.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:31 -04:00
Mike Snitzer
6afbc01d75 dm mpath: eliminate pg_ready() wrapper
pg_ready() is not comprehensive in its logic and only serves to
obfuscate code.  Replace pg_ready() with the appropriate logic in
multipath_map().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:31 -04:00
Joe Thornber
97e7cdf12b dm io: simplify dec_count and sync_io
Remove the io struct off the stack in sync_io() and allocate it from
the mempool like is done in async_io().

dec_count() now always calls a callback function and always frees the io
struct back to the mempool (so sync_io and async_io share this pattern).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:30 -04:00
Anssi Hannula
44fa816bb7 dm cache: fix race affecting dirty block count
nr_dirty is updated without locking, causing it to drift so that it is
non-zero (either a small positive integer, or a very large one when an
underflow occurs) even when there are no actual dirty blocks.  This was
due to a race between the workqueue and map function accessing nr_dirty
in parallel without proper protection.

People were seeing under runs due to a race on increment/decrement of
nr_dirty, see: https://lkml.org/lkml/2014/6/3/648

Fix this by using an atomic_t for nr_dirty.

Reported-by: roma1390@gmail.com
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-08-01 12:25:22 -04:00
Greg Thelen
d8c712ea47 dm bufio: fully initialize shrinker
1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to
struct shrinker assuming that all shrinkers were zero filled.  The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).

Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.

This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once.  This has been broken since 3.12.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
2014-08-01 12:07:21 -04:00
NeilBrown
af5628f05d md: disable probing for md devices 512 and over.
The way md devices are traditionally created in the kernel
is simply to open the device with the desired major/minor number.

This can be problematic as some support tools, notably udev and
programs run by udev, can open a device just to see what is there, and
find that it has created something.  It is easy for a race to cause
udev to open an md device just after it was destroy, causing it to
suddenly re-appear.

For some time we have had an alternate way to create md devices
  echo md_somename > /sys/modules/md_mod/paramaters/new_array

This will always use a minor number of 512 or higher, which mdadm
normally avoids.
Using this makes the creation-by-opening unnecessary, but does
not disable it, so it is still there to cause problems.

This patch disable probing for devices with a major of 9 (MD_MAJOR)
and a minor of 512 and up.  This devices created by writing to
new_array cannot be re-created by opening the node in /dev.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-07-31 13:54:54 +10:00
NeilBrown
2446dba03f md/raid1,raid10: always abort recover on write error.
Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices).  In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable.  For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
2014-07-31 10:16:52 +10:00
Ingo Molnar
ca5bc6cd5d Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:03:00 +02:00
Linus Torvalds
55ae1bd0d2 Fix the dm-thinp and dm-cache targets to disallow changing the data
device's block size.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTyIuDAAoJEMUj8QotnQNaKOcH/iC5X88HYBke3myjj8fkQelw
 05b1slxnKDINTE+L2eBzuyqzcNm8jLq02ltf9x8VuTxdzioR353PVWHfEmVkUYhJ
 DSnPPtLrfF+FsoABWEcjYcHyguUusgpZ9su94yctDErJcscgs9+7hJJhNKSCf+cW
 VmthtG4vXOdGP2Fl9IGQIzbGgwVWfT1QZN7yhFX2WGwgpBP4u4a9b4kY+sVQjfuz
 lcqy0/MTrsI63TATaGeiILbWh86BNxaoeCe+gBXMk6uvPBaJkGCo9o4OZjbe0d0f
 8wnedBiew8OlEZJAVEjxm+eNMukjeAcRE4gz/qTyaYlxwOqXQJCbsGyraWxpnXM=
 =ZDsD
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Fix the dm-thinp and dm-cache targets to disallow changing the data
  device's block size"

* tag 'dm-3.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache metadata: do not allow the data block size to change
  dm thin metadata: do not allow the data block size to change
2014-07-18 06:25:05 -10:00
NeilBrown
743162013d sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:39 +02:00
Mike Snitzer
048e5a07f2 dm cache metadata: do not allow the data block size to change
The block size for the dm-cache's data device must remained fixed for
the life of the cache.  Disallow any attempt to change the cache's data
block size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-07-15 14:07:50 -04:00
Mike Snitzer
9aec8629ec dm thin metadata: do not allow the data block size to change
The block size for the thin-pool's data device must remained fixed for
the life of the thin-pool.  Disallow any attempt to change the
thin-pool's data block size.

It should be noted that attempting to change the data block size via
thin-pool table reload will be ignored as a side-effect of the thin-pool
handover that the thin-pool target does during thin-pool table reload.

Here is an example outcome of attempting to load a thin-pool table that
reduced the thin-pool's data block size from 1024K to 512K.

Before:
kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks

After:
kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
kernel: device-mapper: ioctl: error adding target to table

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-07-15 14:05:26 -04:00
Linus Torvalds
67b9d76f9e . Fix DM multipath IO hang regression from 3.15 due to logic bug in
multipath_busy.  This impacted cable-pull testing and also the ability
   to boot with IPR SCSI on a POWER8 box.
 
 . Fix possible deadlock with deferred device removal by using a new
   dedicated workqueue rather than using the system workqueue.
 
 . Fix NULL pointer crash due to race condition in dm-io's wake up code
   for sync_io by using a completion.
 
 . Update dm-crypt and dm-zero author name following legal name change;
   this is important to Jana so I didn't see any reason to hold it back.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTv+hvAAoJEMUj8QotnQNawwMH/2yQ7AE3dh44jGr1fp0UEP8e
 Vd7HWtUJAm4+lYkPH7AjLCw3YBwWh/ajLXAwMpPBI878o5sgoWTfnq0hbecqoWkt
 5EugETiZ20C3K/llNFpw9xdtlObFwI21WUGqmu8ygYvfSvdbg6THPT5o8BdtEvnb
 MDBrrrpBpUwMCGw3v7jIoYrKZbWmp46iy5KwVqBbXnD3shpOU8KpasyIOrqlrqjJ
 z7BzfprN6ut1zaVs83N4iPMPnSPrIloUisGpPn1r74qRYUv/AXQgiv09WPA3keTN
 erRGFU9Mr0I4MGOLTuqHyCVO0t4tze1pL8jwEk29GUkXXcr9Is4p9I307Cm7WvE=
 =pBlO
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM multipath IO hang regression from 3.15 due to logic bug in
   multipath_busy.  This impacted cable-pull testing and also the
   ability to boot with IPR SCSI on a POWER8 box.

 - Fix possible deadlock with deferred device removal by using a new
   dedicated workqueue rather than using the system workqueue.

 - Fix NULL pointer crash due to race condition in dm-io's wake up code
   for sync_io by using a completion.

 - Update dm-crypt and dm-zero author name following legal name change;
   this is important to Jana so I didn't see any reason to hold it back.

* tag 'dm-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm mpath: fix IO hang due to logic bug in multipath_busy
  dm io: fix a race condition in the wake up code for sync_io
  dm crypt, dm zero: update author name following legal name change
  dm: allocate a special workqueue for deferred device removal
2014-07-11 09:33:36 -07:00
Jun'ichi Nomura
7a7a3b45fe dm mpath: fix IO hang due to logic bug in multipath_busy
Commit e80991773 ("dm mpath: push back requests instead of queueing")
modified multipath_busy() to return true if !pg_ready().  pg_ready()
checks the current state of the multipath device and may return false
even if a new IO is needed to change the state.

Bart Van Assche reported that he had multipath IO lockup when he was
performing cable pull tests.  Analysis showed that the multipath
device had a single path group with both paths active, but that the
path group itself was not active.  During the multipath device state
transitions 'queue_io' got set but nothing could clear it.  Clearing
'queue_io' only happens in __choose_pgpath(), but it won't be called
if multipath_busy() returns true due to pg_ready() returning false
when 'queue_io' is set.

As such the !pg_ready() check in multipath_busy() is wrong because new
IO will not be sent to multipath target and the multipath state change
won't happen.  That results in multipath IO lockup.

The intent of multipath_busy() is to avoid unnecessary cycles of
dequeue + request_fn + requeue if it is known that the multipath
device will requeue.

Such "busy" situations would be:
  - path group is being activated
  - there is no path and the multipath is setup to requeue if no path

Fix multipath_busy() to return "busy" early only for these specific
situations.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.15
2014-07-10 16:44:15 -04:00
Joe Thornber
10f1d5d111 dm io: fix a race condition in the wake up code for sync_io
There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread.  If the thread
is spuriously woken immediately after the decrement it may exit,
making the on stack io struct invalid, yet the dec_count could still
be using it.

Fix this race by using a completion in sync_io() and dec_count().

Reported-by: Minfei Huang <huangminfei@ucloud.cn>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2014-07-10 16:44:14 -04:00
Jana Saout
bf14299f1c dm crypt, dm zero: update author name following legal name change
Signed-off-by: Jana Saout <jana@saout.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-07-10 16:44:14 -04:00
Mikulas Patocka
acfe0ad74d dm: allocate a special workqueue for deferred device removal
The commit 2c140a246dc ("dm: allow remove to be deferred") introduced a
deferred removal feature for the device mapper.  When this feature is
used (by passing a flag DM_DEFERRED_REMOVE to DM_DEV_REMOVE_CMD ioctl)
and the user tries to remove a device that is currently in use, the
device will be removed automatically in the future when the last user
closes it.

Device mapper used the system workqueue to perform deferred removals.
However, some targets (dm-raid1, dm-mpath, dm-stripe) flush work items
scheduled for the system workqueue from their destructor.  If the
destructor itself is called from the system workqueue during deferred
removal, it introduces a possible deadlock - the workqueue tries to flush
itself.

Fix this possible deadlock by introducing a new workqueue for deferred
removals.  We allocate just one workqueue for all dm targets.  The
ability of dm targets to process IOs isn't dependent on deferred removal
of unused targets, so a deadlock due to shared workqueue isn't possible.

Also, cleanup local_init() to eliminate potential for returning success
on failure.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-07-10 16:44:13 -04:00
NeilBrown
133d4527ea md: flush writes before starting a recovery.
When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).

If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.

Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.

We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not.  That
depends on details of the particular raid level and particular write
request.

This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel.  It applied correctly to 3.0 at
least and will minor editing to earlier kernels.

Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
2014-07-03 10:44:45 +10:00
NeilBrown
9bd3592032 md: make sure GET_ARRAY_INFO ioctl reports correct "clean" status
If an array has a bitmap, the when we set the "has bitmap" flag we
incorrectly clear the "is clean" flag.

"is clean" isn't really important when a bitmap is present, but it is
best to get it right anyway.

Reported-by: George Duffield <forumscollective@gmail.com>
Link: http://lkml.kernel.org/CAG__1a4MRV6gJL38XLAurtoSiD3rLBTmWpcS5HYvPpSfPR88UQ@mail.gmail.com
Fixes: 36fa30636fb84b209210299684e1be66d9e58217 (v2.6.14)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-07-03 10:44:31 +10:00
Linus Torvalds
0e04c641b1 . Add dm_accept_partial_bio interface to DM core to allow DM targets
to only process a portion of a bio, the remainder being sent in the
   next bio.  This enables the old dm snapshot-origin target to only
   split write bios on chunk boundaries, read bios are now sent to the
   origin device unchanged.
 
 . Add DM core support for disabling WRITE SAME if the underlying SCSI
   layer disables it due to command failure.
 
 . Reduce lock contention in DM's bio-prison.
 
 . A few small cleanups and fixes to dm-thin and dm-era.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTmavIAAoJEMUj8QotnQNay/EIAJI6lEFlQK3gG830+Yaw1m2U
 7mnnX/Rd/N6RnccyhmFqk7Xu0REM7gJEgWicTQjR58La2DKFi072N0mwgHgHfB8f
 1oOvUKN5Nb/a1CmRcVzSO0sbYcJn9I1r+k0buqfFHivU68wuedG+MrVya3YzOjvC
 63MQiu4+3icDprcToxn+etz75FhrFps5QAsS0cH6t1VZqFCGIzxqgUKgY8zGo1CH
 P9hkYpJhhJe2aDh4vlFvpFYVFXt9zPoR+MkqXFW6Dn9GbR36gGaldTwnyhapla4N
 XYHDEdETcTqm2srOM5wW+jD0p+Id9/Lmd5Ld5J2zyARCYn/EG/SDCbiaVqOxEnc=
 =tv1B
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:
 "This pull request is later than I'd have liked because I was waiting
  for some performance data to help finally justify sending the
  long-standing dm-crypt cpu scalability improvements upstream.

  Unfortunately we came up short, so those dm-crypt changes will
  continue to wait, but it seems we're not far off.

   . Add dm_accept_partial_bio interface to DM core to allow DM targets
     to only process a portion of a bio, the remainder being sent in the
     next bio.  This enables the old dm snapshot-origin target to only
     split write bios on chunk boundaries, read bios are now sent to the
     origin device unchanged.

   . Add DM core support for disabling WRITE SAME if the underlying SCSI
     layer disables it due to command failure.

   . Reduce lock contention in DM's bio-prison.

   . A few small cleanups and fixes to dm-thin and dm-era"

* tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: update discard_granularity to reflect the thin-pool blocksize
  dm bio prison: implement per bucket locking in the dm_bio_prison hash table
  dm: remove symbol export for dm_set_device_limits
  dm: disable WRITE SAME if it fails
  dm era: check for a non-NULL metadata object before closing it
  dm thin: return ENOSPC instead of EIO when error_if_no_space enabled
  dm thin: cleanup noflush_work to use a proper completion
  dm snapshot: do not split read bios sent to snapshot-origin target
  dm snapshot: allocate a per-target structure for snapshot-origin target
  dm: introduce dm_accept_partial_bio
  dm: change sector_count member in clone_info from sector_t to unsigned
2014-06-12 13:33:29 -07:00
Lukas Czerner
09869de57e dm thin: update discard_granularity to reflect the thin-pool blocksize
DM thinp already checks whether the discard_granularity of the data
device is a factor of the thin-pool block size.  But when using the
dm-thin-pool's discard passdown support, DM thinp was not selecting the
max of the underlying data device's discard_granularity and the
thin-pool's block size.

Update set_discard_limits() to set discard_granularity to the max of
these values.  This enables blkdev_issue_discard() to properly align the
discards that are sent to the DM thin device on a full block boundary.
As such each discard will now cover an entire DM thin-pool block and the
block will be reclaimed.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-06-11 16:56:12 -04:00
Heinz Mauelshagen
adcc44472b dm bio prison: implement per bucket locking in the dm_bio_prison hash table
Split the single per bio-prison lock by using per bucket locking.  Per
bucket locking benefits both dm-thin and dm-cache targets by reducing
bio-prison lock contention.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-06-11 16:48:54 -04:00
Linus Torvalds
8d0304e69d Assorted md fixes for 3.16
Mostly performance improvements with a few corner-case bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU5fevznsnt1WYoG5AQKzNxAAgjZPg6LZgl0wEHJi09ykqLK9dSkglz2B
 RY5IEWPEMcsGJQbSX7fEME5WSHkqKBIHx43xmdzijPCHyWLnF4vQRubNAi/Wo8PC
 yfJyI5hMrG8+0rNKnmXWeG+NLGj7l7j+wkExe27VBlu2+6ATvIUCa34kjOWFDJJx
 xAN9/t22XGly776SiuLXTpMhA0pRWH4sqijJhvM5EqPlsEUjGxvQU/URwUPYzhYr
 iCbvY64uKB1Crh+vn7yP1IUWc77aDOoUPIxO50m0GXOQFU/wbQta1NAFmMeu8tIo
 3QRlGzP+jRJFBV7jLii4nHTMSVxBmD2zMppxVtDoq/egWhyZlJhh2Kd5sUKo2r9V
 8bGQm7e1JlsDsRMVXxSlx3TebZDfJhtnFNhYb6JPlpWM5EYjCzXXfJUl3/6LfKX6
 BKJ800xvabr5TFkVubF9H80LTQxaO+pQoS6PSs8hnwneqtybdbt9o7V59yluMGao
 IwaG/BY4jlY+76YnK32Kbr1eofUNYXBLg45mgSkrhUVqv9HLHe0CDnYAzYs68aur
 zG8a1rhA9IJtBmWwql2Gm8lKOnDFM6onvSmaYntsabJ++eahaNDpL5ck+R4kO8mJ
 pyI1N6GQDjtLdONk4NmH36P+CuCZje6KwRbVXvsJIBA1042tkZfU0zdqgt5QhFZj
 Ma9ZCn9wZFo=
 =MaM/
 -----END PGP SIGNATURE-----

Merge tag 'md/3.16' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Assorted md fixes for 3.16

  Mostly performance improvements with a few corner-case bug fixes"

* tag 'md/3.16' of git://neil.brown.name/md:
  raid5: speedup sync_request processing
  md/raid5: deadlock between retry_aligned_read with barrier io
  raid5: add an option to avoid copy data from bio to stripe cache
  md/bitmap: remove confusing code from filemap_get_page.
  raid5: avoid release list until last reference of the stripe
  md: md_clear_badblocks should return an error code on failure.
  md/raid56: Don't perform reads to support writes until stripe is ready.
  md: refuse to change shape of array if it is active but read-only
2014-06-11 08:33:41 -07:00
Eivind Sarto
053f5b6525 raid5: speedup sync_request processing
The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread.  The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
creates some more parallelism between the resync-thread and raid5 kernel daemon.

Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-06-10 11:02:01 +10:00
Linus Torvalds
3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
hui jiao
2844dc32ea md/raid5: deadlock between retry_aligned_read with barrier io
A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs,  the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0

Signed-off-by: NeilBrown <neilb@suse.de>
2014-06-05 17:18:19 +10:00
Mike Snitzer
11f0431be2 dm: remove symbol export for dm_set_device_limits
There is no need for code other than DM core to use dm_set_device_limits
so remove its EXPORT_SYMBOL_GPL.  Also, cleanup a couple whitespace nits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-06-04 09:46:34 -04:00
Mike Snitzer
7eee4ae2db dm: disable WRITE SAME if it fails
Add DM core support for disabling WRITE SAME on first failure to both
request-based and bio-based targets.  The need to disable WRITE SAME
stems from SCSI enabling it by default but then disabling it when it
fails.  When SCSI does this it returns "permanent target failure, do
not retry" using -EREMOTEIO.  Update DM core to only disable WRITE SAME
on failure if the returned error is -EREMOTEIO.

Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails")
implemented multipath specific disabling of WRITE SAME if it fails.
However, as that commit detailed, the multipath-only solution doesn't go
far enough if bio-based DM targets are stacked ontop of the
request-based dm-multipath target (as is commonly done using dm-linear
to support partitions on multipath devices, via kpartx).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Alex Chen <alex.chen@huawei.com>
2014-06-04 09:45:52 -04:00