Commit Graph

142 Commits

Author SHA1 Message Date
Hou Tao
b6e58b5466 dm btree remove: assign new_root only when removal succeeds
remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().

For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:

  general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
  CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
  Hardware name: QEMU Standard PC
  RIP: 0010:metadata_ll_load_ie+0x14/0x30
  Call Trace:
   sm_metadata_count_is_more_than_one+0xb9/0xe0
   dm_tm_shadow_block+0x52/0x1c0
   shadow_step+0x59/0xf0
   remove_raw+0xb2/0x170
   dm_btree_remove+0xf4/0x1c0
   dm_pool_delete_thin_device+0xc3/0x140
   pool_message+0x218/0x2b0
   target_message+0x251/0x290
   ctl_ioctl+0x1c4/0x4d0
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0x7b/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixing it by only assign new_root when removal succeeds

Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-25 15:25:24 -04:00
Joe Thornber
6b06dd5a97 dm space map disk: cache a small number of index entries
The disk space map stores it's index entries in a btree, these are
accessed very frequently, so having a few cached makes a big difference
to performance.

With this change provisioning a new block takes roughly 20% less cpu.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:23 -04:00
Joe Thornber
be500ed721 dm space maps: improve performance with inc/dec on ranges of blocks
When we break sharing on btree nodes we typically need to increment
the reference counts to every value held in the node.  This can
cause a lot of repeated calls to the space maps.  Fix this by changing
the interface to the space map inc/dec methods to take ranges of
adjacent blocks to be operated on.

For installations that are using a lot of snapshots this will reduce
cpu overhead of fundamental operations such as provisioning a new block,
or deleting a snapshot, by as much as 10 times.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:22 -04:00
Joe Thornber
5faafc77f7 dm space maps: don't reset space map allocation cursor when committing
Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device.  There are a couple
of repercussions to this:

- The first allocation after the commit is likely to take longer than
  normal as it searches for a free block in an area that is likely to
  have very few free blocks (if any).

- Any free blocks it finds will have been recently freed.  Reusing them
  means we have fewer old copies of the metadata to aid recovery from
  hardware error.

Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:21 -04:00
Joe Thornber
4eafdb1515 dm btree: improve btree residency
This commit improves the residency of btrees built in the metadata for
dm-thin and dm-cache.

When inserting a new entry into a full btree node the current code
splits the node into two.  This can result in very many half full nodes,
particularly if the insertions are occurring in an ascending order (as
happens in dm-thin with large writes).

With this commit, when we insert into a full node we first try and move
some entries to a neighbouring node that has space, failing that it
tries to split two neighbouring nodes into three.

Results are given below.  'Residency' is how full nodes are on average
as a percentage.  Average instruction counts for the operations
are given to show the extra processing has little overhead.

                         +--------------------------+--------------------------+
                         |         Before           |         After            |
+------------+-----------+-----------+--------------+-----------+--------------+
|    Test    |   Phase   | Residency | Instructions | Residency | Instructions |
+------------+-----------+-----------+--------------+-----------+--------------+
| Ascending  | insert    |        50 |         1876 |        96 |         1930 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Descending | insert    |        50 |         3024 |        96 |         3181 |
|            | overwrite |        50 |         1789 |        96 |         1746 |
|            | lookup    |        50 |          778 |        96 |          778 |
| Random     | insert    |        68 |         3800 |        84 |         3736 |
|            | overwrite |        68 |         4254 |        84 |         3911 |
|            | lookup    |        68 |          779 |        84 |          779 |
| Runs       | insert    |        63 |         2546 |        82 |         2815 |
|            | overwrite |        63 |         2013 |        82 |         1986 |
|            | lookup    |        63 |          778 |        82 |          779 |
+------------+-----------+-----------+--------------+-----------+--------------+

   Ascending - keys are inserted in ascending order.
   Descending - keys are inserted in descending order.
   Random - keys are inserted in random order.
   Runs - keys are split into ascending runs of ~20 length.  Then
          the runs are shuffled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04 12:07:20 -04:00
Joe Thornber
5208692e80 dm space map common: fix division bug in sm_ll_find_free_block()
This division bug meant the search for free metadata space could skip
the final allocation bitmap's worth of entries. Fix affects DM thinp,
cache and era targets.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Tested-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 12:48:13 -04:00
Joe Thornber
a88b2358f1 dm persistent data: packed struct should have an aligned() attribute too
Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
byte-by-byte access.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 12:48:12 -04:00
Joe Thornber
f73e2e70ec dm btree spine: remove paranoid node_check call in node_prep_for_write()
Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 12:47:57 -04:00
Joe Thornber
d6db294fd8 dm space map disk: remove redundant calls to sm_disk_get_nr_free()
Both sm_disk_new_block and sm_disk_commit are needlessly calling
sm_disk_get_nr_free(). Looks like old queries used for some
debugging.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19 12:36:46 -04:00
Jiapeng Chong
ece2577388 dm persistent data: remove unused return from exit_shadow_spine()
Fix the following coccicheck warnings:

./drivers/md/persistent-data/dm-btree-spine.c:188:5-6: Unneeded
variable: "r". Return "0" on line 194.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-03-26 14:53:42 -04:00
Jinoh Kang
4c9e9883c2 dm persistent data: fix return type of shadow_root()
shadow_root() truncates 64-bit dm_block_t into 32-bit int.  This is
not an issue in practice, since dm metadata as of v5.11 can only hold at
most 4161600 blocks (255 index entries * ~16k metadata blocks).

Nevertheless, this can confuse users debugging some specific data
corruption scenarios.  Also, DM_SM_METADATA_MAX_BLOCKS may be bumped in
the future, or persistent-data may find its use in other places.

Therefore, switch the return type of shadow_root from int to dm_block_t.

Signed-off-by: Jinoh Kang <jinoh.kang.kr@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-03 10:10:05 -05:00
Huaisheng Ye
399c9bdbd6 dm thin metadata: Remove unused local variable when create thin and snap
The local variable disk details is not used during the creating of thin & snap
devices. Remove them from dm-thin-metadata, and add pointer validity check for
pointer value in btree_lookup_raw. Skip memory copy when the caller doesn't need
the value.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:11 -04:00
Ye Bin
3a653b205f dm thin metadata: Fix use-after-free in dm_bm_set_read_only
The following error ocurred when testing disk online/offline:

[  301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
[  301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
[  301.849206] Aborting journal on device dm-26-8.
[  301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
[  301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
[  301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]

Reason is:

 metadata_operation_failed
    abort_transaction
        dm_pool_abort_metadata
	    __create_persistent_data_objects
	        r = __open_or_format_metadata
	        if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
		    dm_block_manager_destroy(pmd->bm);
    set_pool_mode
	dm_pool_metadata_read_only(pool->pmd);
	dm_bm_set_read_only(pmd->bm);  --> use-after-free

Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
dm_bm_set_read_write functions.  If bm is NULL it means creating the
bm failed and so dm_bm_is_read_only must return true.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02 13:38:40 -04:00
Masahiro Yamada
a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Gustavo A. R. Silva
b18ae8dd9d dm: replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:44 -04:00
Zhiqiang Liu
9431cf6efc dm persistent data: switch exit_ro_spine to return void
In commit 4c7da06f5a ("dm persistent data: eliminate unnecessary
return values"), r value in exit_ro_spine will not change, so
exit_ro_spine doesn't need a return value.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Joe Thornber
4feaef830d dm space map common: fix to ensure new block isn't already in use
The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets.  There are variants for
tracking metadata blocks and data blocks.

Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was.  This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).

This fix keeps the 'begin' field, but now it's just a hint to speed up
the search.  Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free.  Much simpler.

This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.

Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:15:53 -05:00
Hou Tao
474e559567 dm btree: increase rebalance threshold in __rebalance2()
We got the following warnings from thin_check during thin-pool setup:

  $ thin_check /dev/vdb
  examining superblock
  examining devices tree
    missing devices: [1, 84]
      too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
  examining mapping tree

The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:

  $ new a thin pool
  $ presume the max entries of details_info tree is 126
  $ new 127 thin devices (e.g. 1~127) to make the root node being full
    and then split
  $ remove the first 43 (e.g. 1~43) thin devices to make the children
    reblance repeatedly
  $ stop the thin pool
  $ thin_check

The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).

Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 15:27:52 -05:00
ZhangXiaoxu
c1499a044d dm space map common: remove check for impossible sm_find_free() return value
The function sm_find_free() just returns -ENOSPC and 0.
So remove lone caller's check for some other error.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 15:39:53 -04:00
ZhangXiaoxu
ae148243d3 dm space map metadata: fix missing store of apply_bops() return value
In commit 6096d91af0 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91af0 ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:24 -04:00
ZhangXiaoxu
e4f9d60138 dm btree: fix order of block initialization in btree_split_beneath
When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57df ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:23 -04:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Linus Torvalds
311f71281f - Improve DM snapshot target's scalability by using finer grained
locking.  Requires some list_bl interface improvements.
 
 - Add ability for DM integrity to use a bitmap mode, that tracks regions
   where data and metadata are out of sync, instead of using a journal.
 
 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used.  This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering has
   proven a very common configuration mistake)
 
 - Fix missing call to path selector type->end_io in DM multipath.  This
   fixes reported performance problems due to inaccurate path selector IO
   accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).
 
 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first cache
   block was discarded (thankfully in practice the first cache block is
   generally in use; be it FS superblock, partition table, disk label,
   etc).
 
 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.
 
 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.
 
 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.
 
 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzdcCgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWsxZB/9idHl8LmwwL1JzBfi/XX7bWxwqDQLo
 j1b3ycQ14AKVau4VCkmgDuRIfMDuU6PIAVvsMeVbF3aCE0fZ7zbEV1qHefbtJuCL
 MMm//KbrhIT8oMKYUWtlOj7XI9MT6ErFzfActBZ6UF6r21m1N3bohhVGN7kvCnJm
 wgmSlnz/m2GLKK8gQx+OisnAh0nlje3PIdIYPu7uWN6t0FF2XRz3UwWTuyw7lYhC
 Rx2J+sOIL02CtadhHKLMCG8OutRXWP01cBSohUVJIMGihWfbe6aqvhG5afbqb4bG
 UQrXl477ry5zyQ4fAU2JKZ+8qFvc1FoLLknKrZQu+uYPRokUPw/AwiL7
 =mOH3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve DM snapshot target's scalability by using finer grained
   locking. Requires some list_bl interface improvements.

 - Add ability for DM integrity to use a bitmap mode, that tracks
   regions where data and metadata are out of sync, instead of using a
   journal.

 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used. This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering
   has proven a very common configuration mistake)

 - Fix missing call to path selector type->end_io in DM multipath. This
   fixes reported performance problems due to inaccurate path selector
   IO accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).

 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first
   cache block was discarded (thankfully in practice the first cache
   block is generally in use; be it FS superblock, partition table, disk
   label, etc).

 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.

 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.

 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.

 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.

* tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (42 commits)
  dm: fix a couple brace coding style issues
  dm crypt: print device name in integrity error message
  dm crypt: move detailed message into debug level
  dm ioctl: fix hang in early create error condition
  dm integrity: whitespace, coding style and dead code cleanup
  dm integrity: implement synchronous mode for reboot handling
  dm integrity: handle machine reboot in bitmap mode
  dm integrity: add a bitmap mode
  dm integrity: introduce a function add_new_range_and_wait()
  dm integrity: allow large ranges to be described
  dm ingerity: pass size to dm_integrity_alloc_page_list()
  dm integrity: introduce rw_journal_sectors()
  dm integrity: update documentation
  dm integrity: don't report unused options
  dm integrity: don't check null pointer before kvfree and vfree
  dm integrity: correctly calculate the size of metadata area
  dm dust: Make dm_dust_init and dm_dust_exit static
  dm dust: remove redundant unsigned comparison to less than zero
  dm mpath: always free attached_handler_name in parse_path()
  dm init: fix max devices/targets checks
  ...
2019-05-16 15:55:48 -07:00
Thomas Gleixner
be9c52ed84 dm persistent data: Simplify stack trace handling
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface. This results in less storage space and
indirection.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.533968922@linutronix.de
2019-04-29 12:37:52 +02:00
Mike Snitzer
c6e086e0c9 dm space map common: zero entire ll_disk
Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:32 -04:00
Chengguang Xu
5941c621dc dm block manager: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:48 -05:00
Andy Shevchenko
5cc9cdf631 dm: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand DM bitmap API is special case.
Adding 'dm' prefix to it to avoid potential name space collision.

No functional changes intended.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:38 -07:00
Mikulas Patocka
afa53df869 dm bufio: move dm-bufio.h to include/linux/
Move dm-bufio.h to include/linux/ so that external GPL'd DM target
modules can use it.

It is better to allow the use of dm-bufio than force external modules
to implement the equivalent buffered IO mechanism in some new way.  The
hope is this will encourage the use of dm-bufio; which will then make it
easier for a GPL'd external DM target module to be included upstream.

A couple dm-bufio EXPORT_SYMBOL exports have also been updated to use
EXPORT_SYMBOL_GPL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:23 -04:00
Joe Thornber
bc68d0a435 dm btree: fix serious bug in btree_split_beneath()
When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.

This is a serious bug, and quite hard to make trigger in normal use.  A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
  https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.

Cc: stable@vger.kernel.org
Reported-by: Monty Pavel <monty_pavel@sina.com>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:07:55 -05:00
Linus Torvalds
b91593fa85 - A few conversions from atomic_t to ref_count_t
- A DM core fix for a race during device destruction that could result
   in a BUG_ON.
 
 - A stable@ fix for a DM cache race condition that could lead to data
   corruption when operating in writeback mode (writethrough is default)
 
 - Various DM cache cleanups and improvements
 
 - Add DAX support to the DM log-writes target
 
 - A fix for the DM zoned target's ability to deal with the last zone of
   the drive being smaller than all others.
 
 - A stable@ DM crypt and DM integrity fix for a negative check that was
   to restrictive (prevented slab debug with XFS ontop of DM crypt from
   working).
 
 - A DM raid target fix for a panic that can occur when forcing a raid to
   sync.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaCdOnAAoJEMUj8QotnQNaEYIIANZ2wyrvrJ/6xeOu2qNII07o
 FYnvVvm0D4rDnNVgYbf/FHWRkFYzeNPkKH6Kp38XC+Ag5xeLjkepQG/ivxXrp9eg
 2t6rjUDnUdjgqIQlmysbla+DgphampTVlPMpnafxKiSLItSjf+2tu1mLqtITVjT1
 mo81ZRbKRSYBPvaUzHWUJ910ap+WPCpwTpO98uPQE1wogLEKTAf90U2hfsy51Gd6
 4xStLahdiiGst7zs67uWG5l6g3kR3RnfNVN38oERrq67oxG4GAU1xUPRwlCnJmbx
 waDhlhVjguVDFJh/HYAyBIVls38iGrroox70MmtpmitDYnMs8twrgWcsI6Ozo1c=
 =ZfYD
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15/dm' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - a few conversions from atomic_t to ref_count_t

 - a DM core fix for a race during device destruction that could result
   in a BUG_ON

 - a stable@ fix for a DM cache race condition that could lead to data
   corruption when operating in writeback mode (writethrough is default)

 - various DM cache cleanups and improvements

 - add DAX support to the DM log-writes target

 - a fix for the DM zoned target's ability to deal with the last zone of
   the drive being smaller than all others

 - a stable@ DM crypt and DM integrity fix for a negative check that was
   to restrictive (prevented slab debug with XFS ontop of DM crypt from
   working)

 - a DM raid target fix for a panic that can occur when forcing a raid
   to sync

* tag 'for-4.15/dm' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (25 commits)
  dm cache: lift common migration preparation code to alloc_migration()
  dm cache: remove usused deferred_cells member from struct cache
  dm cache policy smq: allocate cache blocks in order
  dm cache policy smq: change max background work from 10240 to 4096 blocks
  dm cache background tracker: limit amount of background work that may be issued at once
  dm cache policy smq: take origin idle status into account when queuing writebacks
  dm cache policy smq: handle races with queuing background_work
  dm raid: fix panic when attempting to force a raid to sync
  dm integrity: allow unaligned bv_offset
  dm crypt: allow unaligned bv_offset
  dm: small cleanup in dm_get_md()
  dm: fix race between dm_get_from_kobject() and __dm_destroy()
  dm: allocate struct mapped_device with kvzalloc
  dm zoned: ignore last smaller runt zone
  dm space map metadata: use ARRAY_SIZE
  dm log writes: add support for DAX
  dm log writes: add support for inline data buffers
  dm cache: simplify get_per_bio_data() by removing data_size argument
  dm cache: remove all obsolete writethrough-specific code
  dm cache: submit writethrough writes in parallel to origin and cache
  ...
2017-11-14 15:50:56 -08:00
Jérémy Lefaure
fbc61291d7 dm space map metadata: use ARRAY_SIZE
Using the ARRAY_SIZE macro improves the readability of the code.

Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
 (sizeof(E)@p /sizeof(*E))
|
 (sizeof(E)@p /sizeof(E[...]))
|
 (sizeof(E)@p /sizeof(T))
)

Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-11-10 15:44:52 -05:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Joe Thornber
0377a07c7a dm space map disk: fix some book keeping in the disk space map
When decrementing the reference count for a block, the free count wasn't
being updated if the reference count went to zero.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:50 -04:00
Bart Van Assche
73cbca6a63 dm block manager: remove an unused argument from dm_block_manager_create()
The 'cache_size' argument of dm_block_manager_create() has never been
used.  Remove it along with the definitions of the constants passed as
the 'cache_size' argument.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:41 -04:00
Vinothkumar Raja
7d1fedb6e9 dm btree: fix for dm_btree_find_lowest_key()
dm_btree_find_lowest_key() is giving incorrect results.  find_key()
traverses the btree correctly for finding the highest key, but there is
an error in the way it traverses the btree for retrieving the lowest
key.  dm_btree_find_lowest_key() fetches the first key of the rightmost
block of the btree instead of fetching the first key from the leftmost
block.

Fix this by conditionally passing the correct parameter to value64()
based on the @find_highest flag.

Cc: stable@vger.kernel.org
Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu>
Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu>
Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 14:47:49 -04:00
Ingo Molnar
0881e7bd34 sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h>
But first update usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Linus Torvalds
7a771ceac7 - Fix dm-raid transient device failure processing and other smaller
tweaks.
 
 - Add journal support to the DM raid target to close the 'write hole' on
   raid 4/5/6.
 
 - Fix dm-cache corruption, due to rounding bug, when cache exceeds 2TB.
 
 - Add 'metadata2' feature to dm-cache to separate the dirty bitset out
   from other cache metadata.  This improves speed of shutting down
   a large cache device (which implies writing out dirty bits).
 
 - Fix a memory leak during dm-stats data structure destruction.
 
 - Fix a DM multipath round-robin path selector performance regression
   that was caused by less precise balancing across all paths.
 
 - Lastly, introduce a DM core fix for a long-standing DM snapshot
   deadlock that is rooted in the complexity of the device stack used in
   conjunction with block core maintaining bios on current->bio_list to
   manage recursion in generic_make_request().  A more comprehensive fix
   to block core (and its hook in the cpu scheduler) would be wonderful
   but this DM-specific fix is pragmatic considering how difficult it has
   been to make progress on a generic fix.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJYrJJeAAoJEMUj8QotnQNaDskIAIJeMX3Dc8Skt00tZ6vEj3p6
 9juDpOrBKH3RYdqPmrYy9lVhhpFs6OoDfTQZaW/SmjDjHboJ3skKMjO+/NWav4nN
 39LoDfxLbDi06fC7Y4H7FHUPjb5sKSzw4W5IttFEKmHOwkz+iwVFL1R0dihBqv7G
 Lq0Ta6xffW8jHrzpmmSDY1I6FSmZ9LlHPCL00qQ5Z7WkMS5oDk0GzZoLFasdNfvm
 fP9N13+uel2/R7hclpxE6J+IZPN5ARG3HAQ5POS+2gMlIzaH4AlMh7yf5q0sSGwq
 uQsmdps8c+LOtAakOzVScykEZvwBh+ci8VqE1X1zol+fl8ijeWqgWtz4XXYECC0=
 =saD8
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Fix dm-raid transient device failure processing and other smaller
   tweaks.

 - Add journal support to the DM raid target to close the 'write hole'
   on raid 4/5/6.

 - Fix dm-cache corruption, due to rounding bug, when cache exceeds 2TB.

 - Add 'metadata2' feature to dm-cache to separate the dirty bitset out
   from other cache metadata. This improves speed of shutting down a
   large cache device (which implies writing out dirty bits).

 - Fix a memory leak during dm-stats data structure destruction.

 - Fix a DM multipath round-robin path selector performance regression
   that was caused by less precise balancing across all paths.

 - Lastly, introduce a DM core fix for a long-standing DM snapshot
   deadlock that is rooted in the complexity of the device stack used in
   conjunction with block core maintaining bios on current->bio_list to
   manage recursion in generic_make_request(). A more comprehensive fix
   to block core (and its hook in the cpu scheduler) would be wonderful
   but this DM-specific fix is pragmatic considering how difficult it
   has been to make progress on a generic fix.

* tag 'dm-4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
  dm: flush queued bios when process blocks to avoid deadlock
  dm round robin: revert "use percpu 'repeat_count' and 'current_path'"
  dm stats: fix a leaked s->histogram_boundaries array
  dm space map metadata: constify dm_space_map structures
  dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()
  dm persistent data: add cursor skip functions to the cursor APIs
  dm cache metadata: use dm_bitset_new() to create the dirty bitset in format 2
  dm bitset: add dm_bitset_new()
  dm cache metadata: name the cache block that couldn't be loaded
  dm cache metadata: add "metadata2" feature
  dm cache metadata: use bitset cursor api to load discard bitset
  dm bitset: introduce cursor api
  dm btree: use GFP_NOFS in dm_btree_del()
  dm space map common: memcpy the disk root to ensure it's arch aligned
  dm block manager: add unlikely() annotations on dm_bufio error paths
  dm cache: fix corruption seen when using cache > 2TB
  dm raid: cleanup awkward branching in raid_message() option processing
  dm raid: use mddev rather than rdev->mddev
  dm raid: use read_disk_sb() throughout
  dm raid: add raid4/5/6 journaling support
  ...
2017-02-21 12:11:41 -08:00
Bhumika Goyal
b79af13efd dm space map metadata: constify dm_space_map structures
Declare dm_space_map structures as const as they are only passed as an
argument to the function memcpy. This argument is of type const void *,
so dm_space_map structures having this property can be declared as
const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   4889	    240	      0	   5129	   1409 dm-space-map-metadata.o

File size after:
   text	   data	    bss	    dec	    hex	filename
   5139	      0	      0	   5139	   1413 dm-space-map-metadata.o

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 14:14:36 -05:00
Joe Thornber
9b696229aa dm persistent data: add cursor skip functions to the cursor APIs
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:12:50 -05:00
Joe Thornber
2151249eaa dm bitset: add dm_bitset_new()
A more efficient way of creating a populated bitset.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:12:48 -05:00
Joe Thornber
6fe28dbf05 dm bitset: introduce cursor api
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:12:45 -05:00
Joe Thornber
9f9ef0657d dm btree: use GFP_NOFS in dm_btree_del()
dm_btree_del() is called from an ioctl so don't recurse into FS.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:09:10 -05:00
Joe Thornber
3ba3ba1e84 dm space map common: memcpy the disk root to ensure it's arch aligned
The metadata_space_map_root passed to sm_ll_open_metadata() may or may
not be arch aligned, use memcpy to ensure it is.  This is not a fast
path so the extra memcpy doesn't hurt us.

Long-term it'd be better to use the kernel's alignment infrastructure to
remove the memcpy()s that are littered across persistent-data (btree,
array, space-maps, etc).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:02:54 -05:00
Joe Thornber
602548bdd5 dm block manager: add unlikely() annotations on dm_bufio error paths
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16 13:01:07 -05:00
Davidlohr Bueso
642fa448ae sched/core: Remove set_task_state()
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:

  be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")

... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.

However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.

Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().

Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.

[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com

== 1. x86-64 ==

Avg runtime set_task_state():    601 msecs
Avg runtime set_current_state(): 552 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)

== 2. ppc64le ==

Avg runtime set_task_state():  938 msecs
Avg runtime set_current_state: 940 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)

x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:16 +01:00
Benjamin Marzinski
b446396b74 dm space map: always set ev if sm_ll_mutate() succeeds
If no block was allocated or freed, sm_ll_mutate() wasn't setting
*ev, leaving the variable unitialized. sm_ll_insert(),
sm_disk_inc_block(), and sm_disk_new_block() all check ev to see
if there was an allocation event in sm_ll_mutate(), possibly
reading unitialized data.

If no allocation event occured, sm_ll_mutate() should set *ev
to SM_NONE.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:15 -05:00
Benjamin Marzinski
0c79ce0b75 dm space map metadata: skip useless memcpy in metadata_ll_init_index()
When metadata_ll_init_index() is called by sm_ll_new_metadata(),
ll->mi_le hasn't been initialized yet. So, when
metadata_ll_init_index() copies the contents of ll->mi_le into the
newly allocated bitmap_root, it is just copying garbage. ll->mi_le
will be allocated later in sm_ll_extend() and copied into the
bitmap_root, in sm_ll_commit().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:15 -05:00
Benjamin Marzinski
314c25c56c dm space map metadata: fix 'struct sm_metadata' leak on failed create
In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).

If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').

Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-12-08 14:13:14 -05:00
Bart Van Assche
0637018dff dm array: remove a dead assignment in populate_ablock_with_values()
A value is assigned to 'nr_entries' but is never used, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:09 -05:00
Joe Thornber
2e8ed71102 dm block manager: make block locking optional
The block manager's locking is useful for catching cycles that may
result from certain btree metadata corruption.  But in general it serves
as a developer tool to catch bugs in code.  Unless you're finding that
DM thin provisioning is hanging due to infinite loops within the block
manager's access to btree nodes you can safely disable this feature.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de> # do/while(0) macro fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-14 15:17:47 -05:00