IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Starting with commit c0820cf5ad095 ("dm: introduce per_bio_data"),
device mapper has the capability to pre-allocate a target-specific
structure with the bio.
This patch changes dm-delay to use this facility instead of a slab cache
and mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.
If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.
The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.
Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.
Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
- if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+ if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+ KEY_DIRTY(k) != KEY_DIRTY(replace_key))
goto check_failed;
/* skip past gen */
at the beginning (schedule_timout_interuptible) and others
do his on their own
This prevents wrong load average calculation (load of 1 per thread)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-provisioning
and caching targets as a result of increased regression testing using
the device-mapper-test-suite (dmts). The most notable of these are the
reference counting fixes for the space map btree that is used by the
dm-array interface -- without these the dm-cache metadata will leak,
resulting in dm-cache devices running out of metadata blocks. Also,
some important fixes related to the thin-provisioning target's
transition to read-only mode on error.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax
B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW
aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr
7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK
pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg
/SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk=
=0Ix+
-----END PGP SIGNATURE-----
Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"A set of device-mapper fixes for 3.13.
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-
provisioning and caching targets as a result of increased regression
testing using the device-mapper-test-suite (dmts). The most notable
of these are the reference counting fixes for the space map btree that
is used by the dm-array interface -- without these the dm-cache
metadata will leak, resulting in dm-cache devices running out of
metadata blocks. Also, some important fixes related to the
thin-provisioning target's transition to read-only mode on error"
* tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm array: fix a reference counting bug in shadow_ablock
dm space map: disallow decrementing a reference count below zero
dm stats: initialize read-only module parameter
dm bufio: initialize read-only module parameters
dm cache: actually resize cache
dm cache: update Documentation for invalidate_cblocks's range syntax
dm cache policy mq: fix promotions to occur as expected
dm thin: allow pool in read-only mode to transition to read-write mode
dm thin: re-establish read-only state when switching to fail mode
dm thin: always fallback the pool mode if commit fails
dm thin: switch to read-only mode if metadata space is exhausted
dm thin: switch to read only mode if a mapping insert fails
dm space map metadata: return on failure in sm_metadata_new_block
dm table: fail dm_table_create on dm_round_up overflow
dm snapshot: avoid snapshot space leak on crash
dm delay: fix a possible deadlock due to shared workqueue
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
The old behaviour, returning -EINVAL if a ref_count of 0 would be
decremented, was removed in commit f722063 ("dm space map: optimise
sm_ll_dec and sm_ll_inc"). To fix this regression we return an error
code from the mutator function pointer passed to sm_ll_mutate() and have
dec_ref_count() return -EINVAL if the old ref_count is 0.
Add a DMERR to reflect the potential seriousness of this error.
Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path.
With this fix the following dmts regression test now passes:
dmtest run --suite cache -n /metadata_use_kernel/
The next patch fixes the higher-level dm-array code that exposed this
regression.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.
This patch performs the following renames.
* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above
Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.
This patch is strictly rename only and doesn't introduce any
functional difference.
- mic / gpio renames were missing. Spotted by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The module parameter stats_current_allocated_bytes in dm-mod is
read-only. This parameter informs the user about memory
consumption. It is not supposed to be changed by the user.
However, despite being read-only, this parameter can be set on
modprobe or insmod command line:
modprobe dm-mod stats_current_allocated_bytes=12345
The kernel doesn't expect that this variable can be non-zero at module
initialization and if the user sets it, it results in warning.
This patch initializes the variable in the module init routine, so
that user-supplied value is ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
Commit f494a9c6b1b6dd9a9f21bbb75d9210d478eeb498 ("dm cache: cache
shrinking support") broke cache resizing support.
dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op. But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized. Fix this by passing the new_size to dm_cache_resize().
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Micro benchmarks that repeatedly issued IO to a single block were
failing to cause a promotion from the origin device to the cache. Fix
this by not updating the stats during map() if -EWOULDBLOCK will be
returned.
The mq policy will only update stats, consider migration, etc, once per
tick period (a unit of time established between dm-cache core and the
policies).
When the IO thread calls the policy's map method, if it would like to
migrate the associated block it returns -EWOULDBLOCK, the IO then gets
handed over to a worker thread which handles the migration. The worker
thread calls map again, to check the migration is still needed (avoids a
race among other things). *BUT*, before this fix, if we were still in
the same tick period the stats were already updated by the previous map
call -- so the migration would no longer be requested.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A thin-pool may be in read-only mode because the pool's data or metadata
space was exhausted. To allow for recovery, by adding more space to the
pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
mode. Otherwise, running out of space will render the pool permanently
read-only.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
If the thin-pool transitioned to fail mode and the thin-pool's table
were reloaded for some reason: the new table's default pool mode would
be read-write, though it will transition to fail mode during resume.
When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
to re-establish the intermediate read-only state in both the metadata
and persistent-data block manager (as is usually done with the normal
pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Rename commit_or_fallback() to commit(). Now all previous calls to
commit() will trigger the pool mode to fallback if the commit fails.
Also, check the error returned from commit() in alloc_data_block().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode in alloc_data_block() if
dm_pool_alloc_data_block() fails because the pool's metadata space is
exhausted.
Differentiate between data and metadata space in messages about no
free space available.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
<snip ... these repeat for a _very_ long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: commit failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().
This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There is a possible leak of snapshot space in case of crash.
The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.
For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250
Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260
Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.
This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.
Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes the following sparse warning:
drivers/md/bcache/btree.c:2220:5: warning:
symbol 'btree_insert_fn' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
commit 566c09c53455d7c4f1 raid5: relieve lock contention in get_active_stripe()
modified the locking in get_active_stripe() reducing the range
protected by the (highly contended) device_lock.
Unfortunately it reduced the range too much opening up some races.
One race can occur if get_priority_stripe runs between the
test on sh->count and device_lock being taken.
This will mean that sh->lru is not empty while get_active_stripe
thinks ->count is zero resulting in a 'BUG' firing.
Another race happens if __release_stripe is called immediately
after sh->count is tested and found to be non-zero. If STRIPE_HANDLE
is not set, get_active_stripe should increment ->active_stripes
when it increments ->count from 0, but as it didn't think it was 0,
it doesn't.
Extending device_lock to cover the test on sh->count close these
races.
While we are here, fix the two BUG tests:
-If count is zero, then lru really must not be empty, or we've
lock the stripe_head somehow - no other tests are relevant.
-STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
testing it is pointless.
Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Fixes: 566c09c53455d7c4f1
Signed-off-by: NeilBrown <neilb@suse.de>
commit 7a0a5355cbc71efa md: Don't test all of mddev->flags at once.
made most tests on mddev->flags safer, but missed one.
When
commit 260fa034ef7a4ff8b7306 md: avoid deadlock when dirty buffers during md_stop.
added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
It can think there is something to do but find nothing. This can
lead to the md thread spinning during array shutdown.
https://bugzilla.kernel.org/show_bug.cgi?id=65721
Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 260fa034ef7a4ff8b7306
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: NeilBrown <neilb@suse.de>
In alloc_thread_groups, worker_groups is a pointer to an array,
not an array of pointers.
So
worker_groups[i]
is wrong. It should be
&(*worker_groups)[i]
Found-by: coverity
Fixes: 60aaf9338545
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.
Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.
(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
This is prep work for introducing a more general bio_split().
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Sage Weil <sage@inktank.com>
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.
Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Now that drivers have been converted to the new bvec_iter primitives,
there's no need to trim the bvec before we submit it; and we can't trim
it once we start sharing bvecs.
It used to be that passing a partially completed bio (i.e. one with
nonzero bi_idx) to generic_make_request() was a dangerous thing -
various drivers would choke on such things. But with immutable biovecs
and our new bio splitting that shares the biovecs, submitting partially
completed bios has to work (and should work, now that all the drivers
have been completed to the new primitives)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
We need to convert the dm code to the new bvec_iter primitives which
respect bi_bvec_done; they also allow us to drastically simplify dm's
bio splitting code.
Also, it's no longer necessary to save/restore the bvec array anymore -
driver conversions for immutable bvecs are done, so drivers should never
be modifying it.
Also kill bio_sector_offset(), dm was the only user and it doesn't make
much sense anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
bio_clone() just got more expensive - however, most users of bio_clone()
don't actually need to modify the biovec. If they aren't modifying the
biovec, and they can guarantee that the original bio isn't freed before
the clone (also true in most cases), we can just point the clone at the
original bio's biovec.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.
So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.
(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>