linux

iv/linux

Author	SHA1	Message	Date
Mike Snitzer	c46985e211	dm space map metadata: limit errors in sm_metadata_new_block The "unable to allocate new metadata block" error can be a particularly verbose error if there is a systemic issue with the metadata device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:11:46 -05:00
Mikulas Patocka	42065460ae	dm delay: use per-bio data instead of a mempool and slab cache Starting with commit c0820cf5ad095 ("dm: introduce per_bio_data"), device mapper has the capability to pre-allocate a target-specific structure with the bio. This patch changes dm-delay to use this facility instead of a slab cache and mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:11:45 -05:00
Mikulas Patocka	57a2f23856	dm table: remove unused buggy code that extends the targets array A device mapper table is allocated in the following way: * The function dm_table_create is called, it gets the number of targets as an argument -- it allocates a targets array accordingly. * For each target, we call dm_table_add_target. If we add more targets than were specified in dm_table_create, the function dm_table_add_target reallocates the targets array. However, this reallocation code is wrong - it moves the targets array to a new location, while some target constructors hold pointers to the array in the old location. The following DM target drivers save the pointer to the target structure, so they corrupt memory if the target array is moved: multipath, raid, mirror, snapshot, stripe, switch, thin, verity. Under normal circumstances, the reallocation function is not called (because dm_table_create is called with the correct number of targets), so the buggy reallocation code is not used. Prior to the fix "dm table: fail dm_table_create on dm_round_up overflow", the reallocation code could only be used in case the user specifies too large a value in param->target_count, such as 0xffffffff. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:11:44 -05:00
Joe Thornber	19fa1a6756	dm thin: fix discard support to a previously shared block If a snapshot is created and later deleted the origin dm_thin_device's snapshotted_time will have been updated to reflect the snapshot's creation time. The 'shared' flag in the dm_thin_lookup_result struct returned from dm_thin_find_block() is an approximation based on snapshotted_time -- this is done to avoid 0(n), or worse, time complexity. In this case, the shared flag would be true. But because the 'shared' flag reflects an approximation a block can be incorrectly assumed to be shared (e.g. false positive for 'shared' because the snapshot no longer exists). This could result in discards issued to a thin device not being passed down to the pool's underlying data device. To fix this we double check that a thin block is really still in-use after a mapping is removed using dm_pool_block_is_used(). If the reference count for a block is now zero the discard is allowed to be passed down. Also add a 'definitely_not_shared' member to the dm_thin_new_mapping structure -- reflects that the 'shared' flag in the response from dm_thin_find_block() can only be held as definitive if false is returned. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 10:11:43 -05:00
Mike Snitzer	16961b042d	dm thin: initialize dm_thin_new_mapping returned by get_next_mapping As additional members are added to the dm_thin_new_mapping structure care should be taken to make sure they get initialized before use. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 10:10:03 -05:00
Jens Axboe	b28bc9b38c	Linux 3.13-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU= =/4/R -----END PGP SIGNATURE----- Merge tag 'v3.13-rc6' into for-3.14/core Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c	2013-12-31 09:51:02 -07:00
Linus Torvalds	c5fdd531b5	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - fix for a memory leak on certain unplug events - a collection of bcache fixes from Kent and Nicolas - a few null_blk fixes and updates form Matias - a marking of static of functions in the stec pci-e driver * 'for-linus' of git://git.kernel.dk/linux-block: null_blk: support submit_queues on use_per_node_hctx null_blk: set use_per_node_hctx param to false null_blk: corrections to documentation null_blk: warning on ignored submit_queues param null_blk: refactor init and init errors code paths null_blk: documentation null_blk: mem garbage on NUMA systems during init drivers: block: Mark the functions as static in skd_main.c bcache: New writeback PD controller bcache: bugfix for race between moving_gc and bucket_invalidate bcache: fix for gc and writeback race bcache: bugfix - moving_gc now moves only correct buckets bcache: fix for gc crashing when no sectors are used bcache: Fix heap_peek() macro bcache: Fix for can_attach_cache() bcache: Fix dirty_data accounting bcache: Use uninterruptible sleep in writeback bcache: kthread don't set writeback task to INTERUPTIBLE block: fix memory leaks on unplugging block device bcache: fix sparse non static symbol warning	2013-12-24 10:06:03 -08:00
Greg Kroah-Hartman	5bd2010fbe	Merge 3.13-rc5 into staging-next We want these fixes here to handle some merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-24 09:43:21 -08:00
Jens Axboe	60e53a6701	Merge branch 'bcache-for-3.13' of git://evilpiepirate.org/~kent/linux-bcache into for-linus Kent writes: Jens - small pile of bcache fixes. I've been slacking on the writeback fixes but those definitely need to get into 3.13.	2013-12-17 12:54:03 -07:00
Kent Overstreet	16749c23c0	bcache: New writeback PD controller The old writeback PD controller could get into states where it had throttled all the way down and take way too long to recover - it was too complicated to really understand what it was doing. This rewrites a good chunk of it to hopefully be simpler and make more sense, and it also pays more attention to units which should make the behaviour a bit easier to understand. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:59 -08:00
Kent Overstreet	6d3d1a9c54	bcache: bugfix for race between moving_gc and bucket_invalidate There is a possibility for a bucket to be invalidated by the allocator while moving_gc was copying it's contents to another bucket, if the bucket only held cached data. To prevent this moving checks for a stale ptr (to an invalidated bucket), before and after reads. It it finds one, it simply ignores moving that data. This only affects bcache if the moving_gc was turned on, note that it's off by default. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	bf0a628a95	bcache: fix for gc and writeback race Garbage collector needs to check keys in the writeback keybuf to make sure it's not invalidating buckets to which the writeback keys point to. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	981aa8c091	bcache: bugfix - moving_gc now moves only correct buckets Removed gc_move_threshold because picking buckets only by threshold could lead moving extra buckets (ei. if there are buckets at the threshold that aren't supposed to be moved do to space considerations). This is replaced by a GC_MOVE bit in the gc_mark bitmask. Now only marked buckets get moved. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	bee63f40cb	bcache: fix for gc crashing when no sectors are used Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Nicholas Swenson	97d11a660f	bcache: Fix heap_peek() macro Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Nicholas Swenson	9eb8ebeb24	bcache: Fix for can_attach_cache() Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Kent Overstreet	d24a6e1087	bcache: Fix dirty_data accounting Dirty data accounting wasn't quite right - firstly, we were adding the key we're inserting after it could have merged with another dirty key already in the btree, and secondly we could sometimes pass the wrong offset to bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is important when tracking dirty data by stripe. NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes necessary that got squashed in with other patches; the full patch against 3.10 is 408cc2f47eeac93a, available at: git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2a46036..4a12b2f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree b, struct bkey insert, if (KEY_START(k) > KEY_START(insert) + sectors_found) goto check_failed; - if (KEY_PTRS(replace_key) != KEY_PTRS(k)) + if (KEY_PTRS(k) != KEY_PTRS(replace_key) \|\| + KEY_DIRTY(k) != KEY_DIRTY(replace_key)) goto check_failed; /* skip past gen */	2013-12-16 14:22:16 -08:00
Kent Overstreet	ce2b3f595e	bcache: Use uninterruptible sleep in writeback We're just waiting on kthread_should_stop(), nothing else, so interruptible sleep was wrong here. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:04:57 -08:00
Stefan Priebe	f665c0f852	bcache: kthread don't set writeback task to INTERUPTIBLE at the beginning (schedule_timout_interuptible) and others do his on their own This prevents wrong load average calculation (load of 1 per thread) Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:04:57 -08:00
Linus Torvalds	93e1585e2c	A set of device-mapper fixes for 3.13. A fix for possible memory corruption during DM table load, fix a possible leak of snapshot space in case of a crash, fix a possible deadlock due to a shared workqueue in the delay target, fix to initialize read-only module parameters that are used to export metrics for dm stats and dm bufio. Quite a few stable fixes were identified for both the thin-provisioning and caching targets as a result of increased regression testing using the device-mapper-test-suite (dmts). The most notable of these are the reference counting fixes for the space map btree that is used by the dm-array interface -- without these the dm-cache metadata will leak, resulting in dm-cache devices running out of metadata blocks. Also, some important fixes related to the thin-provisioning target's transition to read-only mode on error. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.15 (GNU/Linux) iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr 7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg /SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk= =0Ix+ -----END PGP SIGNATURE----- Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "A set of device-mapper fixes for 3.13. A fix for possible memory corruption during DM table load, fix a possible leak of snapshot space in case of a crash, fix a possible deadlock due to a shared workqueue in the delay target, fix to initialize read-only module parameters that are used to export metrics for dm stats and dm bufio. Quite a few stable fixes were identified for both the thin- provisioning and caching targets as a result of increased regression testing using the device-mapper-test-suite (dmts). The most notable of these are the reference counting fixes for the space map btree that is used by the dm-array interface -- without these the dm-cache metadata will leak, resulting in dm-cache devices running out of metadata blocks. Also, some important fixes related to the thin-provisioning target's transition to read-only mode on error" * tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm array: fix a reference counting bug in shadow_ablock dm space map: disallow decrementing a reference count below zero dm stats: initialize read-only module parameter dm bufio: initialize read-only module parameters dm cache: actually resize cache dm cache: update Documentation for invalidate_cblocks's range syntax dm cache policy mq: fix promotions to occur as expected dm thin: allow pool in read-only mode to transition to read-write mode dm thin: re-establish read-only state when switching to fail mode dm thin: always fallback the pool mode if commit fails dm thin: switch to read-only mode if metadata space is exhausted dm thin: switch to read only mode if a mapping insert fails dm space map metadata: return on failure in sm_metadata_new_block dm table: fail dm_table_create on dm_round_up overflow dm snapshot: avoid snapshot space leak on crash dm delay: fix a possible deadlock due to shared workqueue	2013-12-13 13:22:22 -08:00
Joe Thornber	ed9571f0cf	dm array: fix a reference counting bug in shadow_ablock An old array block could have its reference count decremented below zero when it is being replaced in the btree by a new array block. The fix is to increment the old ablock's reference count just before inserting a new ablock into the btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.9+	2013-12-13 14:22:10 -05:00
Joe Thornber	5b564d80f8	dm space map: disallow decrementing a reference count below zero The old behaviour, returning -EINVAL if a ref_count of 0 would be decremented, was removed in commit f722063 ("dm space map: optimise sm_ll_dec and sm_ll_inc"). To fix this regression we return an error code from the mutator function pointer passed to sm_ll_mutate() and have dec_ref_count() return -EINVAL if the old ref_count is 0. Add a DMERR to reflect the potential seriousness of this error. Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path. With this fix the following dmts regression test now passes: dmtest run --suite cache -n /metadata_use_kernel/ The next patch fixes the higher-level dm-array code that exposed this regression. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.12+	2013-12-13 14:22:09 -05:00
Tejun Heo	324a56e16e	kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_elem_dir/kernfs_elem_dir/ * s/sysfs_elem_symlink/kernfs_elem_symlink/ * s/sysfs_elem_attr/kernfs_elem_file/ * s/sysfs_dirent/kernfs_node/ * s/sd/kn/ in kernfs proper * s/parent_sd/parent/ * s/target_sd/target/ * s/dir_sd/parent/ * s/to_sysfs_dirent()/rb_to_kn()/ * misc renames of local vars when they conflict with the above Because md, mic and gpio dig into sysfs details, this patch ends up modifying them. All are sysfs_dirent renames and trivial. While we can avoid these by introducing a dummy wrapping struct sysfs_dirent around kernfs_node, given the limited usage outside kernfs and sysfs proper, I don't think such workaround is called for. This patch is strictly rename only and doesn't introduce any functional difference. - mic / gpio renames were missing. Spotted by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-11 15:28:36 -08:00
Mikulas Patocka	76f5bee5c3	dm stats: initialize read-only module parameter The module parameter stats_current_allocated_bytes in dm-mod is read-only. This parameter informs the user about memory consumption. It is not supposed to be changed by the user. However, despite being read-only, this parameter can be set on modprobe or insmod command line: modprobe dm-mod stats_current_allocated_bytes=12345 The kernel doesn't expect that this variable can be non-zero at module initialization and if the user sets it, it results in warning. This patch initializes the variable in the module init routine, so that user-supplied value is ignored. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.12+	2013-12-10 19:13:21 -05:00
Mikulas Patocka	4cb57ab4a2	dm bufio: initialize read-only module parameters Some module parameters in dm-bufio are read-only. These parameters inform the user about memory consumption. They are not supposed to be changed by the user. However, despite being read-only, these parameters can be set on modprobe or insmod command line, for example: modprobe dm-bufio current_allocated_bytes=12345 The kernel doesn't expect that these variables can be non-zero at module initialization and if the user sets them, it results in BUG. This patch initializes the variables in the module init routine, so that user-supplied values are ignored. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.2+	2013-12-10 19:13:20 -05:00
Vincent Pelletier	088448007b	dm cache: actually resize cache Commit f494a9c6b1b6dd9a9f21bbb75d9210d478eeb498 ("dm cache: cache shrinking support") broke cache resizing support. dm_cache_resize() is called with cache->cache_size before it gets updated to new_size, so it is a no-op. But the dm-cache superblock is updated with the new_size even though the backing dm-array is not resized. Fix this by passing the new_size to dm_cache_resize(). Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2013-12-10 16:35:15 -05:00
Joe Thornber	af95e7a69b	dm cache policy mq: fix promotions to occur as expected Micro benchmarks that repeatedly issued IO to a single block were failing to cause a promotion from the origin device to the cache. Fix this by not updating the stats during map() if -EWOULDBLOCK will be returned. The mq policy will only update stats, consider migration, etc, once per tick period (a unit of time established between dm-cache core and the policies). When the IO thread calls the policy's map method, if it would like to migrate the associated block it returns -EWOULDBLOCK, the IO then gets handed over to a worker thread which handles the migration. The worker thread calls map again, to check the migration is still needed (avoids a race among other things). BUT, before this fix, if we were still in the same tick period the stats were already updated by the previous map call -- so the migration would no longer be requested. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2013-12-10 16:35:14 -05:00
Joe Thornber	9b7aaa64f9	dm thin: allow pool in read-only mode to transition to read-write mode A thin-pool may be in read-only mode because the pool's data or metadata space was exhausted. To allow for recovery, by adding more space to the pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE mode. Otherwise, running out of space will render the pool permanently read-only. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:13 -05:00
Joe Thornber	5383ef3a92	dm thin: re-establish read-only state when switching to fail mode If the thin-pool transitioned to fail mode and the thin-pool's table were reloaded for some reason: the new table's default pool mode would be read-write, though it will transition to fail mode during resume. When the pool mode transitions directly from PM_WRITE to PM_FAIL we need to re-establish the intermediate read-only state in both the metadata and persistent-data block manager (as is usually done with the normal pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:12 -05:00
Joe Thornber	020cc3b5e2	dm thin: always fallback the pool mode if commit fails Rename commit_or_fallback() to commit(). Now all previous calls to commit() will trigger the pool mode to fallback if the commit fails. Also, check the error returned from commit() in alloc_data_block(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:12 -05:00
Mike Snitzer	4a02b34e0c	dm thin: switch to read-only mode if metadata space is exhausted Switch the thin pool to read-only mode in alloc_data_block() if dm_pool_alloc_data_block() fails because the pool's metadata space is exhausted. Differentiate between data and metadata space in messages about no free space available. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed <snip ... these repeat for a _very_ long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: commit failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:04 -05:00
Joe Thornber	fafc7a815e	dm thin: switch to read only mode if a mapping insert fails Switch the thin pool to read-only mode when dm_thin_insert_block() fails since there is little reason to expect the cause of the failure to be resolved without further action by user space. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block <snip ... these repeat for a long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:29 -05:00
Mike Snitzer	f62b6b8f49	dm space map metadata: return on failure in sm_metadata_new_block Commit 2fc48021f4afdd109b9e52b6eef5db89ca80bac7 ("dm persistent metadata: add space map threshold callback") introduced a regression to the metadata block allocation path that resulted in errors being ignored. This regression was uncovered by running the following device-mapper-test-suite test: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The ignored error codes in sm_metadata_new_block() could crash the kernel through use of either the dm-thin or dm-cache targets, e.g.: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block general protection fault: 0000 [#1] SMP ... Workqueue: dm-thin do_worker [dm_thin_pool] task: ffff880035ce2ab0 ti: ffff88021a054000 task.ti: ffff88021a054000 RIP: 0010:[<ffffffffa0331385>] [<ffffffffa0331385>] metadata_ll_load_ie+0x15/0x30 [dm_persistent_data] RSP: 0018:ffff88021a055a68 EFLAGS: 00010202 RAX: 003fc8243d212ba0 RBX: ffff88021a780070 RCX: ffff88021a055a78 RDX: ffff88021a055a78 RSI: 0040402222a92a80 RDI: ffff88021a780070 RBP: ffff88021a055a68 R08: ffff88021a055ba4 R09: 0000000000000010 R10: 0000000000000000 R11: 00000002a02e1000 R12: ffff88021a055ad4 R13: 0000000000000598 R14: ffffffffa0338470 R15: ffff88021a055ba4 FS: 0000000000000000(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f467c0291b8 CR3: 0000000001a0b000 CR4: 00000000000007e0 Stack: ffff88021a055ab8 ffffffffa0332020 ffff88021a055b30 0000000000000001 ffff88021a055b30 0000000000000000 ffff88021a055b18 0000000000000000 ffff88021a055ba4 ffff88021a055b98 ffff88021a055ae8 ffffffffa033304c Call Trace: [<ffffffffa0332020>] sm_ll_lookup_bitmap+0x40/0xa0 [dm_persistent_data] [<ffffffffa033304c>] sm_metadata_count_is_more_than_one+0x8c/0xc0 [dm_persistent_data] [<ffffffffa0333825>] dm_tm_shadow_block+0x65/0x110 [dm_persistent_data] [<ffffffffa0331b00>] sm_ll_mutate+0x80/0x300 [dm_persistent_data] [<ffffffffa0330e60>] ? set_ref_count+0x10/0x10 [dm_persistent_data] [<ffffffffa0331dba>] sm_ll_inc+0x1a/0x20 [dm_persistent_data] [<ffffffffa0332270>] sm_disk_new_block+0x60/0x80 [dm_persistent_data] [<ffffffff81520036>] ? down_write+0x16/0x40 [<ffffffffa001e5c4>] dm_pool_alloc_data_block+0x54/0x80 [dm_thin_pool] [<ffffffffa001b23c>] alloc_data_block+0x9c/0x130 [dm_thin_pool] [<ffffffffa001c27e>] provision_block+0x4e/0x180 [dm_thin_pool] [<ffffffffa001fe9a>] ? dm_thin_find_block+0x6a/0x110 [dm_thin_pool] [<ffffffffa001c57a>] process_bio+0x1ca/0x1f0 [dm_thin_pool] [<ffffffff8111e2ed>] ? mempool_free+0x8d/0xa0 [<ffffffffa001d755>] process_deferred_bios+0xc5/0x230 [dm_thin_pool] [<ffffffffa001d911>] do_worker+0x51/0x60 [dm_thin_pool] [<ffffffff81067872>] process_one_work+0x182/0x3b0 [<ffffffff81068c90>] worker_thread+0x120/0x3a0 [<ffffffff81068b70>] ? manage_workers+0x160/0x160 [<ffffffff8106eb2e>] kthread+0xce/0xe0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # v3.10+	2013-12-10 16:34:28 -05:00
Mikulas Patocka	5b2d06576c	dm table: fail dm_table_create on dm_round_up overflow The dm_round_up function may overflow to zero. In this case, dm_table_create() must fail rather than go on to allocate an empty array with alloc_targets(). This fixes a possible memory corruption that could be caused by passing too large a number in "param->target_count". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:27 -05:00
Mikulas Patocka	230c83afdd	dm snapshot: avoid snapshot space leak on crash There is a possible leak of snapshot space in case of crash. The reason for space leaking is that chunks in the snapshot device are allocated sequentially, but they are finished (and stored in the metadata) out of order, depending on the order in which copying finished. For example, supposed that the metadata contains the following records SUPERBLOCK METADATA (blocks 0 ... 250) DATA 0 DATA 1 DATA 2 ... DATA 250 Now suppose that you allocate 10 new data blocks 251-260. Suppose that copying of these blocks finish out of order (block 260 finished first and the block 251 finished last). Now, the snapshot device looks like this: SUPERBLOCK METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256) DATA 0 DATA 1 DATA 2 ... DATA 250 DATA 251 DATA 252 DATA 253 DATA 254 DATA 255 METADATA (blocks 255, 254, 253, 252, 251) DATA 256 DATA 257 DATA 258 DATA 259 DATA 260 Now, if the machine crashes after writing the first metadata block but before writing the second metadata block, the space for areas DATA 250-255 is leaked, it contains no valid data and it will never be used in the future. This patch makes dm-snapshot complete exceptions in the same order they were allocated, thus fixing this bug. Note: when backporting this patch to the stable kernel, change the version field in the following way: * if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0} * if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it to {1, 10, 2} Userspace reads the version to determine if the bug was fixed, so the version change is needed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:25 -05:00
Linus Torvalds	5ee540613d	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. It contains: - A fix for a use-after-free of a request in blk-mq. From Ming Lei - A fix for a blk-mq bug that could attempt to dereference a NULL rq if allocation failed - Two xen-blkfront small fixes - Cleanup of submit_bio_wait() type uses in the kernel, unifying that. From Kent - A fix for 32-bit blkg_rwstat reading. I apologize for this one looking mangled in the shortlog, it's entirely my fault for missing an empty line between the description and body of the text" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix use-after-free of request blk-mq: fix dereference of rq->mq_ctx if allocation fails block: xen-blkfront: Fix possible NULL ptr dereference xen-blkfront: Silence pfn maybe-uninitialized warning block: submit_bio_wait() conversions Update of blkg_stat and blkg_rwstat may happen in bh context	2013-12-05 15:33:27 -08:00
Mike Snitzer	8d30726912	dm cache: increment bi_remaining when bi_end_io is restored Move the bio->bi_remaining increment into dm_unhook_bio() so the overwrite_endio() handler works as expected. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-03 19:16:04 -07:00
Wei Yongjun	08239ca2a0	bcache: fix sparse non static symbol warning Fixes the following sparse warning: drivers/md/bcache/btree.c:2220:5: warning: symbol 'btree_insert_fn' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-11-28 17:05:58 -08:00
NeilBrown	6d183de407	md/raid5: fix newly-broken locking in get_active_stripe. commit 566c09c53455d7c4f1 raid5: relieve lock contention in get_active_stripe() modified the locking in get_active_stripe() reducing the range protected by the (highly contended) device_lock. Unfortunately it reduced the range too much opening up some races. One race can occur if get_priority_stripe runs between the test on sh->count and device_lock being taken. This will mean that sh->lru is not empty while get_active_stripe thinks ->count is zero resulting in a 'BUG' firing. Another race happens if __release_stripe is called immediately after sh->count is tested and found to be non-zero. If STRIPE_HANDLE is not set, get_active_stripe should increment ->active_stripes when it increments ->count from 0, but as it didn't think it was 0, it doesn't. Extending device_lock to cover the test on sh->count close these races. While we are here, fix the two BUG tests: -If count is zero, then lru really must not be empty, or we've lock the stripe_head somehow - no other tests are relevant. -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so testing it is pointless. Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com> Reviewed-by: Shaohua Li <shli@kernel.org> Fixes: 566c09c53455d7c4f1 Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:15 +11:00
NeilBrown	142d44c310	md: test mddev->flags more safely in md_check_recovery. commit 7a0a5355cbc71efa md: Don't test all of mddev->flags at once. made most tests on mddev->flags safer, but missed one. When commit 260fa034ef7a4ff8b7306 md: avoid deadlock when dirty buffers during md_stop. added MD_STILL_CLOSED, this caused md_check_recovery to misbehave. It can think there is something to do but find nothing. This can lead to the md thread spinning during array shutdown. https://bugzilla.kernel.org/show_bug.cgi?id=65721 Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com> Fixes: 260fa034ef7a4ff8b7306 Cc: stable@vger.kernel.org (3.12) Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:08 +11:00
NeilBrown	0c775d5208	md/raid5: fix new memory-reference bug in alloc_thread_groups. In alloc_thread_groups, worker_groups is a pointer to an array, not an array of pointers. So worker_groups[i] is wrong. It should be &(*worker_groups)[i] Found-by: coverity Fixes: 60aaf9338545 Reported-by: Ben Hutchings <bhutchings@solarflare.com> Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:04 +11:00
Kent Overstreet	c170bbb45f	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-24 16:33:41 -07:00
Kent Overstreet	20d0189b10	block: Introduce new bio_split() The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>	2013-11-23 22:33:57 -08:00
Kent Overstreet	ee67891bf1	block: Rename bio_split() -> bio_pair_split() This is prep work for introducing a more general bio_split(). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Sage Weil <sage@inktank.com>	2013-11-23 22:33:56 -08:00
Kent Overstreet	196d38bccf	block: Generic bio chaining This adds a generic mechanism for chaining bio completions. This is going to be used for a bio_split() replacement, and it turns out to be very useful in a fair amount of driver code - a fair number of drivers were implementing this in their own roundabout ways, often painfully. Note that this means it's no longer to call bio_endio() more than once on the same bio! This can cause problems for drivers that save/restore bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all - in all but the simplest cases they'd be better off just cloning the bio, and immutable biovecs is making bio cloning cheaper. But for now, we add a bio_endio_nodec() for these cases. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-11-23 22:33:56 -08:00
Kent Overstreet	e90abc8ec3	block: Remove bi_idx hacks Now that drivers have been converted to the new bvec_iter primitives, there's no need to trim the bvec before we submit it; and we can't trim it once we start sharing bvecs. It used to be that passing a partially completed bio (i.e. one with nonzero bi_idx) to generic_make_request() was a dangerous thing - various drivers would choke on such things. But with immutable biovecs and our new bio splitting that shares the biovecs, submitting partially completed bios has to work (and should work, now that all the drivers have been completed to the new primitives) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-11-23 22:33:55 -08:00
Kent Overstreet	1c3b13e64c	dm: Refactor for new bio cloning/splitting We need to convert the dm code to the new bvec_iter primitives which respect bi_bvec_done; they also allow us to drastically simplify dm's bio splitting code. Also, it's no longer necessary to save/restore the bvec array anymore - driver conversions for immutable bvecs are done, so drivers should never be modifying it. Also kill bio_sector_offset(), dm was the only user and it doesn't make much sense anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Reviewed-by: Mike Snitzer <snitzer@redhat.com>	2013-11-23 22:33:55 -08:00
Kent Overstreet	59d276fe02	block: Add bio_clone_fast() bio_clone() just got more expensive - however, most users of bio_clone() don't actually need to modify the biovec. If they aren't modifying the biovec, and they can guarantee that the original bio isn't freed before the clone (also true in most cases), we can just point the clone at the original bio's biovec. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-11-23 22:33:54 -08:00
Kent Overstreet	003b5c5719	block: Convert drivers to immutable biovecs Now that we've got a mechanism for immutable biovecs - bi_iter.bi_bvec_done - we need to convert drivers to use primitives that respect it instead of using the bvec array directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com	2013-11-23 22:33:51 -08:00
Kent Overstreet	458b76ed2f	block: Kill bio_segments()/bi_vcnt usage When we start sharing biovecs, keeping bi_vcnt accurate for splits is going to be error prone - and unnecessary, if we refactor some code. So bio_segments() has to go - but most of the existing users just needed to know if the bio had multiple segments, which is easier - add a bio_multiple_segments() for them. (Two of the current uses of bio_segments() are going to go away in a couple patches, but the current implementation of bio_segments() is unsafe as soon as we start doing driver conversions for immutable biovecs - so implement a dumb version for bisectability, it'll go away in a couple patches) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com>	2013-11-23 22:33:51 -08:00

... 5 6 7 8 9 ...

3353 Commits