Commit Graph

6164 Commits

Author SHA1 Message Date
Khazhismel Kumykov
2613eab119 dm mpath: add Historical Service Time Path Selector
This new selector keeps an exponential moving average of the service
time for each path (losely defined as delta between start_io and
end_io), and uses this along with the number of inflight requests to
estimate future service time for a path.  Since we don't have a prober
to account for temporally slow paths, re-try "slow" paths every once in
a while (num_paths * historical_service_time). To account for fast paths
transitioning to slow, if a path has not completed any request within
(num_paths * historical_service_time), limit the number of outstanding
requests.  To account for low volume situations where number of
inflight IOs would be zero, the last finish time of each path is
factored in.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Gabriel Krisman Bertazi
087615bf3a dm mpath: pass IO start time to path selector
The HST path selector needs this information to perform path
prediction. For request-based mpath, struct request's io_start_time_ns
is used, while for bio-based, use the start_time stored in dm_io.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Mikulas Patocka
48338daaa0 dm writecache: improve performance on DDR persistent memory (Optane)
When testing the dm-writecache target on a real DDR persistent memory
(Intel Optane), it turned out that explicit cache flushing using the
clflushopt instruction performs better than non-temporal stores for
block sizes 1k, 2k and 4k.

The dm-writecache target is singlethreaded (all the copying is done
while holding the writecache lock), so it benefits from clwb, see:
http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com

Add a new function memcpy_flushcache_optimized() that tests if
clflushopt is present - and if it is, we use it instead of
memcpy_flushcache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Mikulas Patocka
499c18045e dm writecache: remove superfluous test in persistent_memory_claim
Remove superfluous test if dax_dev is NULL - dax_direct_access already
does this test.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Zhiqiang Liu
9431cf6efc dm persistent data: switch exit_ro_spine to return void
In commit 4c7da06f5a ("dm persistent data: eliminate unnecessary
return values"), r value in exit_ro_spine will not change, so
exit_ro_spine doesn't need a return value.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
YueHaibing
a86fe8be51 dm integrity: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/dm-integrity.c: In function 'integrity_metadata':
drivers/md/dm-integrity.c:1557:12: warning:
 variable 'save_metadata_offset' set but not used [-Wunused-but-set-variable]
drivers/md/dm-integrity.c:1556:12: warning:
 variable 'save_metadata_block' set but not used [-Wunused-but-set-variable]

They are never used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Heinz Mauelshagen
a5089a95d8 dm ebs: pass discards down to underlying device
Make use of dm_bufio_issue_discard() to pass discards down to the
underlying device.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Mikulas Patocka
6fbeb0048e dm bufio: implement discard
Add functions dm_bufio_issue_discard and dm_bufio_discard_buffers.
dm_bufio_issue_discard sends discard request to the underlying device.
dm_bufio_discard_buffers frees buffers in the range and then calls
dm_bufio_issue_discard.

Also, factor out block_to_sector for reuse in dm_bufio_issue_discard.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Heinz Mauelshagen
d3c7b35c20 dm: add emulated block size target
This new target is similar to the linear target except that it emulates
a smaller logical block size on a device with a larger logical block
size.  Its main purpose is to emulate 512 byte sectors on 4K native
disks (i.e. 512e).

See Documentation/admin-guide/device-mapper/dm-ebs.rst for details.

Reviewed-by: Damien Le Moal <DamienLeMoal@wdc.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> [Kconfig fixes]
Signed-off-by: Zheng Bin <zhengbin13@huawei.com> [static fixes]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Martin Wilck
2361ae5953 dm mpath: switch paths in dm_blk_ioctl() code path
SCSI LUN passthrough code such as qemu's "scsi-block" device model
pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
calls choose_pgpath() only in the block IO code path, not in the ioctl
code path (unless current_pgpath is NULL). This has the effect that no
path switching and thus no load balancing is done for SCSI-passthrough
IO, unless the active path fails.

Fix this by using the same logic in multipath_prepare_ioctl() as in
multipath_clone_and_map().

Note: The allegedly best path selection algorithm, service-time,
still wouldn't work perfectly, because the io size of the current
request is always set to 0. Changing that for the IO passthrough
case would require the ioctl cmd and arg to be passed to dm's
prepare_ioctl() method.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Dmitry Baryshkov
27f5411a71 dm crypt: support using encrypted keys
Allow one to use "encrypted" in addition to "user" and "logon" key
types for device encryption.

Signed-off-by: Dmitry Baryshkov <dmitry_baryshkov@mentor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:34 -04:00
Linus Torvalds
c45e8bccec - Document DM integrity allow_discard feature that was added during
5.7 merge window.
 
 - Fix potential for DM writecache data corruption during DM table
   reloads.
 
 - Fix DM verity's FEC support's hash block number calculation in
   verity_fec_decode().
 
 - Fix bio-based DM multipath crash due to use of stale copy of
   MPATHF_QUEUE_IO flag state in __map_bio().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6rSn4THHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWjmFB/9kwi4L5SQTg81ak0jmiO4lNSxqvc8s
 ZJUyT8m/Iheh6FWo6kZYyfN+YVPPwXRFDWk4TGSrKyoqElTbpxkkmVxE7nhp9Z+O
 1f7cfD8vCsTexHxYjxx3DxW529YVjvccifhaJFQgCA3II8+0to9PFzc7v6JpFBGS
 1CWd169OUDHe2XGNmah0lbgwEgb7ZQRn9MNrkdYu3L9HihNLs8h4uin390FlfhSj
 +/rS89jkA9X5MhFnspKrX0wGl3qoQBzFFvvn4KKPcH8EPt5zsUPVo+/rTOBL8Tr8
 VXDYKco4CDfuqzB5LdHeTi3mKwGq+fpPmwzRTk+/gosDZHA49m7mLKGB
 =cr8K
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Document DM integrity allow_discard feature that was added during 5.7
   merge window.

 - Fix potential for DM writecache data corruption during DM table
   reloads.

 - Fix DM verity's FEC support's hash block number calculation in
   verity_fec_decode().

 - Fix bio-based DM multipath crash due to use of stale copy of
   MPATHF_QUEUE_IO flag state in __map_bio().

* tag 'for-5.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath
  dm verity fec: fix hash block number in verity_fec_decode
  dm writecache: fix data corruption when reloading the target
  dm integrity: document allow_discard option
2020-04-30 16:45:08 -07:00
Gabriel Krisman Bertazi
5686dee34d dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath
When adding devices that don't have a scsi_dh on a BIO based multipath,
I was able to consistently hit the warning below and lock-up the system.

The problem is that __map_bio reads the flag before it potentially being
modified by choose_pgpath, and ends up using the older value.

The WARN_ON below is not trivially linked to the issue. It goes like
this: The activate_path delayed_work is not initialized for non-scsi_dh
devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
Nevertheless, only for BIO-based mpath, we cache the flag before calling
choose_pgpath, and use the older version when deciding if we should
initialize the path.  Therefore, we end up trying to initialize the
paths, and calling the non-initialized activate_path work.

[   82.437100] ------------[ cut here ]------------
[   82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
  __queue_delayed_work+0x71/0x90
[   82.438436] Modules linked in:
[   82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
[   82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
[   82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
[   82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
[   82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
[   82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
[   82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
[   82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
[   82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
[   82.445427] FS:  00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
[   82.446040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
[   82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   82.448012] Call Trace:
[   82.448164]  queue_delayed_work_on+0x6d/0x80
[   82.448472]  __pg_init_all_paths+0x7b/0xf0
[   82.448714]  pg_init_all_paths+0x26/0x40
[   82.448980]  __multipath_map_bio.isra.0+0x84/0x210
[   82.449267]  __map_bio+0x3c/0x1f0
[   82.449468]  __split_and_process_non_flush+0x14a/0x1b0
[   82.449775]  __split_and_process_bio+0xde/0x340
[   82.450045]  ? dm_get_live_table+0x5/0xb0
[   82.450278]  dm_process_bio+0x98/0x290
[   82.450518]  dm_make_request+0x54/0x120
[   82.450778]  generic_make_request+0xd2/0x3e0
[   82.451038]  ? submit_bio+0x3c/0x150
[   82.451278]  submit_bio+0x3c/0x150
[   82.451492]  mpage_readpages+0x129/0x160
[   82.451756]  ? bdev_evict_inode+0x1d0/0x1d0
[   82.452033]  read_pages+0x72/0x170
[   82.452260]  __do_page_cache_readahead+0x1ba/0x1d0
[   82.452624]  force_page_cache_readahead+0x96/0x110
[   82.452903]  generic_file_read_iter+0x84f/0xae0
[   82.453192]  ? __seccomp_filter+0x7c/0x670
[   82.453547]  new_sync_read+0x10e/0x190
[   82.453883]  vfs_read+0x9d/0x150
[   82.454172]  ksys_read+0x65/0xe0
[   82.454466]  do_syscall_64+0x4e/0x210
[   82.454828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[...]
[   82.462501] ---[ end trace bb39975e9cf45daa ]---

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-28 19:51:46 -04:00
Sunwook Eom
ad4e80a639 dm verity fec: fix hash block number in verity_fec_decode
The error correction data is computed as if data and hash blocks
were concatenated. But hash block number starts from v->hash_start.
So, we have to calculate hash block number based on that.

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org
Signed-off-by: Sunwook Eom <speed.eom@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-16 16:16:38 -04:00
Mikulas Patocka
31b2212019 dm writecache: fix data corruption when reloading the target
The dm-writecache reads metadata in the target constructor. However, when
we reload the target, there could be another active instance running on
the same device. This is the sequence of operations when doing a reload:

1. construct new target
2. suspend old target
3. resume new target
4. destroy old target

Metadata that were written by the old target between steps 1 and 2 would
not be visible by the new target.

Fix the data corruption by loading the metadata in the resume handler.

Also, validate block_size is at least as large as both the devices'
logical block size and only read 1 block from the metadata during
target constructor -- no need to read entirety of metadata now that it
is done during resume.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-16 16:04:13 -04:00
Linus Torvalds
9b06860d7c libnvdimm for 5.7
- Add support for region alignment configuration and enforcement to
   fix compatibility across architectures and PowerPC page size
   configurations.
 
 - Introduce 'zero_page_range' as a dax operation. This facilitates
   filesystem-dax operation without a block-device.
 
 - Introduce phys_to_target_node() to facilitate drivers that want to
   know resulting numa node if a given reserved address range was
   onlined.
 
 - Advertise a persistence-domain for of_pmem and papr_scm. The
   persistence domain indicates where cpu-store cycles need to reach in
   the platform-memory subsystem before the platform will consider them
   power-fail protected.
 
 - Promote numa_map_to_online_node() to a cross-kernel generic facility.
 
 - Save x86 numa information to allow for node-id lookups for reserved
   memory ranges, deploy that capability for the e820-pmem driver.
 
 - Pick up some miscellaneous minor fixes, that missed v5.6-final,
   including a some smatch reports in the ioctl path and some unit test
   compilation fixups.
 
 - Fixup some flexible-array declarations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
 iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
 5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
 BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
 gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
 G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
 5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
 e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
 x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
 5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
 7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
 qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
 =sY4N
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and dax updates from Dan Williams:
 "There were multiple touches outside of drivers/nvdimm/ this round to
  add cross arch compatibility to the devm_memremap_pages() interface,
  enhance numa information for persistent memory ranges, and add a
  zero_page_range() dax operation.

  This cycle I switched from the patchwork api to Konstantin's b4 script
  for collecting tags (from x86, PowerPC, filesystem, and device-mapper
  folks), and everything looks to have gone ok there. This has all
  appeared in -next with no reported issues.

  Summary:

   - Add support for region alignment configuration and enforcement to
     fix compatibility across architectures and PowerPC page size
     configurations.

   - Introduce 'zero_page_range' as a dax operation. This facilitates
     filesystem-dax operation without a block-device.

   - Introduce phys_to_target_node() to facilitate drivers that want to
     know resulting numa node if a given reserved address range was
     onlined.

   - Advertise a persistence-domain for of_pmem and papr_scm. The
     persistence domain indicates where cpu-store cycles need to reach
     in the platform-memory subsystem before the platform will consider
     them power-fail protected.

   - Promote numa_map_to_online_node() to a cross-kernel generic
     facility.

   - Save x86 numa information to allow for node-id lookups for reserved
     memory ranges, deploy that capability for the e820-pmem driver.

   - Pick up some miscellaneous minor fixes, that missed v5.6-final,
     including a some smatch reports in the ioctl path and some unit
     test compilation fixups.

   - Fixup some flexible-array declarations"

* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
  dax: Move mandatory ->zero_page_range() check in alloc_dax()
  dax,iomap: Add helper dax_iomap_zero() to zero a range
  dax: Use new dax zero page method for zeroing a page
  dm,dax: Add dax zero_page_range operation
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dax, pmem: Add a dax operation zero_page_range
  pmem: Add functions for reading/writing page to/from pmem
  libnvdimm: Update persistence domain value for of_pmem and papr_scm device
  tools/test/nvdimm: Fix out of tree build
  libnvdimm/region: Fix build error
  libnvdimm/region: Replace zero-length array with flexible-array member
  libnvdimm/label: Replace zero-length array with flexible-array member
  ACPI: NFIT: Replace zero-length array with flexible-array member
  libnvdimm/region: Introduce an 'align' attribute
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm: Out of bounds read in __nd_ioctl()
  acpi/nfit: improve bounds checking for 'func'
  mm/memremap_pages: Introduce memremap_compat_align()
  ...
2020-04-08 21:03:40 -07:00
Linus Torvalds
de3c913c6e - Fix excessive bio splitting that caused performance regressions.
- Fix logic bug in DM integrity discard support's integrity tag
   testing.
 
 - Fix DM integrity warning on ppc64le due to missing cast.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6HbiETHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiRFCACoiewv5jBvqzYvINy0FpgwLyU0SOOY
 11UCIzeLgQUPMZ4a5CUpPQuqSxKR3g7RAoD1EJZ1cyOnAJuk6A+VPkOLDioa4BEC
 uS7ifenihclFjcpWjQaKTNEhuTURYSjIwk1wqSwb7Fv+L2Uo+4XB8DvazBZdMq8+
 0EAO2CNl3r9Tkut0MRGr7DKrZa4QauaPsl7BvtzlLZbC3Nj2VncpwePYv9z7c7Ra
 g0zAq8IJlOPgDBUt4szbEAKiwGWQEiG6yxcyj5J3keU6mrHAg0mr08rxaapFkfSK
 F++3gTYjjLi8rQmczufvJ440t2kjK+b4dOfxMyjZccgjcoPxrxxkgaWq
 =Cj9X
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix excessive bio splitting that caused performance regressions

 - Fix logic bug in DM integrity discard support's integrity tag testing

 - Fix DM integrity warning on ppc64le due to missing cast

* tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: fix logic bug in integrity tag testing
  Revert "dm: always call blk_queue_split() in dm_process_bio()"
  dm integrity: fix ppc64le warning
2020-04-03 14:44:48 -07:00
Mikulas Patocka
8267d8fb48 dm integrity: fix logic bug in integrity tag testing
If all the bytes are equal to DISCARD_FILLER, we want to accept the
buffer. If any of the bytes are different, we must do thorough
tag-by-tag checking.

The condition was inverted.

Fixes: 84597a44a9 ("dm integrity: add optional discard support")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 13:07:41 -04:00
Mike Snitzer
120c9257f5 Revert "dm: always call blk_queue_split() in dm_process_bio()"
This reverts commit effd58c95f.

blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.

"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes).  Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).

Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:

xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

before this revert:

253,0   21        1     0.000000000  2206  Q   R 224 + 4072 [xfs_io]
253,0   21        2     0.000008267  2206  X   R 224 / 480 [xfs_io]
253,0   21        3     0.000010530  2206  X   R 224 / 256 [xfs_io]
253,0   21        4     0.000027022  2206  X   R 480 / 736 [xfs_io]
253,0   21        5     0.000028751  2206  X   R 480 / 512 [xfs_io]
253,0   21        6     0.000033323  2206  X   R 736 / 992 [xfs_io]
253,0   21        7     0.000035130  2206  X   R 736 / 768 [xfs_io]
253,0   21        8     0.000039146  2206  X   R 992 / 1248 [xfs_io]
253,0   21        9     0.000040734  2206  X   R 992 / 1024 [xfs_io]
253,0   21       10     0.000044694  2206  X   R 1248 / 1504 [xfs_io]
253,0   21       11     0.000046422  2206  X   R 1248 / 1280 [xfs_io]
253,0   21       12     0.000050376  2206  X   R 1504 / 1760 [xfs_io]
253,0   21       13     0.000051974  2206  X   R 1504 / 1536 [xfs_io]
253,0   21       14     0.000055881  2206  X   R 1760 / 2016 [xfs_io]
253,0   21       15     0.000057462  2206  X   R 1760 / 1792 [xfs_io]
253,0   21       16     0.000060999  2206  X   R 2016 / 2272 [xfs_io]
253,0   21       17     0.000062489  2206  X   R 2016 / 2048 [xfs_io]
253,0   21       18     0.000066133  2206  X   R 2272 / 2528 [xfs_io]
253,0   21       19     0.000067507  2206  X   R 2272 / 2304 [xfs_io]
253,0   21       20     0.000071136  2206  X   R 2528 / 2784 [xfs_io]
253,0   21       21     0.000072764  2206  X   R 2528 / 2560 [xfs_io]
253,0   21       22     0.000076185  2206  X   R 2784 / 3040 [xfs_io]
253,0   21       23     0.000077486  2206  X   R 2784 / 2816 [xfs_io]
253,0   21       24     0.000080885  2206  X   R 3040 / 3296 [xfs_io]
253,0   21       25     0.000082316  2206  X   R 3040 / 3072 [xfs_io]
253,0   21       26     0.000085788  2206  X   R 3296 / 3552 [xfs_io]
253,0   21       27     0.000087096  2206  X   R 3296 / 3328 [xfs_io]
253,0   21       28     0.000093469  2206  X   R 3552 / 3808 [xfs_io]
253,0   21       29     0.000095186  2206  X   R 3552 / 3584 [xfs_io]
253,0   21       30     0.000099228  2206  X   R 3808 / 4064 [xfs_io]
253,0   21       31     0.000101062  2206  X   R 3808 / 3840 [xfs_io]
253,0   21       32     0.000104956  2206  X   R 4064 / 4096 [xfs_io]
253,0   21       33     0.001138823     0  C   R 4096 + 200 [0]

after this revert:

253,0   18        1     0.000000000  4430  Q   R 224 + 3896 [xfs_io]
253,0   18        2     0.000018359  4430  X   R 224 / 256 [xfs_io]
253,0   18        3     0.000028898  4430  X   R 256 / 512 [xfs_io]
253,0   18        4     0.000033535  4430  X   R 512 / 768 [xfs_io]
253,0   18        5     0.000065684  4430  X   R 768 / 1024 [xfs_io]
253,0   18        6     0.000091695  4430  X   R 1024 / 1280 [xfs_io]
253,0   18        7     0.000098494  4430  X   R 1280 / 1536 [xfs_io]
253,0   18        8     0.000114069  4430  X   R 1536 / 1792 [xfs_io]
253,0   18        9     0.000129483  4430  X   R 1792 / 2048 [xfs_io]
253,0   18       10     0.000136759  4430  X   R 2048 / 2304 [xfs_io]
253,0   18       11     0.000152412  4430  X   R 2304 / 2560 [xfs_io]
253,0   18       12     0.000160758  4430  X   R 2560 / 2816 [xfs_io]
253,0   18       13     0.000183385  4430  X   R 2816 / 3072 [xfs_io]
253,0   18       14     0.000190797  4430  X   R 3072 / 3328 [xfs_io]
253,0   18       15     0.000197667  4430  X   R 3328 / 3584 [xfs_io]
253,0   18       16     0.000218751  4430  X   R 3584 / 3840 [xfs_io]
253,0   18       17     0.000226005  4430  X   R 3840 / 4096 [xfs_io]
253,0   18       18     0.000250404  4430  Q   R 4120 + 176 [xfs_io]
253,0   18       19     0.000847708     0  C   R 4096 + 24 [0]
253,0   18       20     0.000855783     0  C   R 4120 + 176 [0]

Fixes: effd58c95f ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 11:32:19 -04:00
Mike Snitzer
e7fc1e57d9 dm integrity: fix ppc64le warning
Otherwise:

In file included from drivers/md/dm-integrity.c:13:
drivers/md/dm-integrity.c: In function 'dm_integrity_status':
drivers/md/dm-integrity.c:3061:10: error: format '%llu' expects
argument of type 'long long unsigned int', but argument 4 has type
'long int' [-Werror=format=]
   DMEMIT("%llu %llu",
          ^~~~~~~~~~~
    atomic64_read(&ic->number_of_mismatches),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/device-mapper.h:550:46: note: in definition of macro 'DMEMIT'
      0 : scnprintf(result + sz, maxlen - sz, x))
                                              ^
cc1: all warnings being treated as errors

Fixes: 7649194a16 ("dm integrity: remove sector type casts")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 10:44:24 -04:00
Vivek Goyal
4e4ced9379 dax: Move mandatory ->zero_page_range() check in alloc_dax()
zero_page_range() dax operation is mandatory for dax devices. Right now
that check happens in dax_zero_page_range() function. Dan thinks that's
too late and its better to do the check earlier in alloc_dax().

I also modified alloc_dax() to return pointer with error code in it in
case of failure. Right now it returns NULL and caller assumes failure
happened due to -ENOMEM. But with this ->zero_page_range() check, I
need to return -EINVAL instead.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00
Vivek Goyal
cdf6cdcd3b dm,dax: Add dax zero_page_range operation
This patch adds support for dax zero_page_range operation to dm targets.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20200228163456.1587-5-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00
Linus Torvalds
ffc1c20c46 - Add DM writecache "cleaner" policy feature that allows cache to be
flushed while userspace monitors for completion to then discommision
   use of caching.
 
 - Optimize DM writecache superblock writing and also yield CPU while
   initializing writecache on large PMEM devices to avoid CPU stalls.
 
 - Various fixes to DM integrity target while preparing for the
   ability to resize a DM integrity device.  In addition to resize
   support, add optional discard support with the "allow_discards"
   feature.
 
 - Fix DM clone target's discard handling and overflow bugs which could
   cause data corruption.
 
 - Fix memory leak in destructor for DM verity FEC support.
 
 - Fix DM zoned target's redundant increment of nr_rnd_zones.
 
 - Small cleanup in DM crypt to use crypt_integrity_aead() helper.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6ExcsTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWj6OB/4n/EmXRI3x9uFTyaFFEjaALTUx7gye
 hIlOLtRTFmU6yit/uqAARLDBMDElhL/ze8RKs/TSmi/FH37u8d6DscG5dPatCsF1
 dZ7z77uxhc0RQ+WkyMBtYqxO1OGzULt8434Pos0x1aoPrK+wUEpJOcZAJompAVfj
 nD3AbJ92zcv7DEdJGiCbViIrgrkAXkUWByXmn/l0AIJEjyxeCLhJdx76I+9PesOJ
 JKbbjLu1w19Yyo807CRBQLhC9fXhDUJO19jIlKRfGZ9Xa6V2xVuB45VVYiM/jGqI
 L9z6SsXBquX+8x9HgARuw82EkBp5DeazjrHQ2jHMHjMBhz+NFSK0MMnD
 =27Eq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add DM writecache "cleaner" policy feature that allows cache to be
   flushed while userspace monitors for completion to then discommision
   use of caching.

 - Optimize DM writecache superblock writing and also yield CPU while
   initializing writecache on large PMEM devices to avoid CPU stalls.

 - Various fixes to DM integrity target while preparing for the ability
   to resize a DM integrity device. In addition to resize support, add
   optional discard support with the "allow_discards" feature.

 - Fix DM clone target's discard handling and overflow bugs which could
   cause data corruption.

 - Fix memory leak in destructor for DM verity FEC support.

 - Fix DM zoned target's redundant increment of nr_rnd_zones.

 - Small cleanup in DM crypt to use crypt_integrity_aead() helper.

* tag 'for-5.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions()
  dm clone: Add missing casts to prevent overflows and data corruption
  dm clone: Add overflow check for number of regions
  dm clone: Fix handling of partial region discards
  dm writecache: add cond_resched to avoid CPU hangs
  dm integrity: improve discard in journal mode
  dm integrity: add optional discard support
  dm integrity: allow resize of the integrity device
  dm integrity: factor out get_provided_data_sectors()
  dm integrity: don't replay journal data past the end of the device
  dm integrity: remove sector type casts
  dm integrity: fix a crash with unusually large tag size
  dm zoned: remove duplicate nr_rnd_zones increase in dmz_init_zone()
  dm verity fec: fix memory leak in verity_fec_dtr
  dm writecache: optimize superblock write
  dm writecache: implement gradual cleanup
  dm writecache: implement the "cleaner" policy
  dm writecache: do direct write if the cache is full
  dm integrity: print device name in integrity_metadata() error message
  dm crypt: use crypt_integrity_aead() helper
2020-04-01 15:33:12 -07:00
Linus Torvalds
1592614838 for-5.7/drivers-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJDYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplhMD/95jd4nlVetHAo54z+Zk2ExE13+yDamRKyh
 vc7t2tz1reqFOimtVr5aVuTXCTgOx4CpiIox5qcn6qAExN4JtCChOBRGize/0u8S
 ckxnhHbN2C0rfnGldvrYYeNRonFI+7QKimnurWUSYYGN0xqbo21BxJ7dFaohMseo
 q4K8sIW0ctE6AOlw28Jerkg614s2NDGZ7q1laheXnYHn5c9f1m0NaKN/jyTGgr0X
 TLBiLbX2yRrAuvpctBj6Fna6YN7Vdd9jsf2Bt6ipUI1XgHQoVUGMxQNhWPyjsbSv
 GzRQUNAfVcasLzCP/Mj/47144OkUtDDpn2mjeXDaFljLDGFULD+jp/SsOmLCxkPC
 gI7G2yfBvF96/SOyT0JXrLyMcBd1R2vRoASbc5tPu82mZhx7YJZH5WYtOB9h2gra
 RTYo3xcm0EoN6yeMaH+xOuXxTWWInIrgKPONW4H8s7hxEiMt5oFNVBI7vqPr4LVp
 tpfxiKZDavKOofKXogNV4W7mSMP/Ir5Q9Ha4g5SXHBGp0z/PHmnQ0xDGNq0KDnU4
 eNO0UYCFNCNa+0AOhpNxaVuVm9LjrgvyXRjePgOZQ4akhohwHO6DLrHK1f8Hb1vD
 8Ih6uR+F5zZlKsouWro8HLGYm5w40Wq9tbCI8QbPYH6nkGoDmzpPv9jbAeWgJU5c
 KqP/5TBSLA==
 =Bs4E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - floppy driver cleanup series from Willy

 - NVMe updates and fixes (Various)

 - null_blk trace improvements (Chaitanya)

 - bcache fixes (Coly)

 - md fixes (via Song)

 - loop block size change optimizations (Martijn)

 - scnprintf() use (Takashi)

* tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
  null_blk: add trace in null_blk_zoned.c
  null_blk: add tracepoint helpers for zoned mode
  block: add a zone condition debug helper
  nvme: cleanup namespace identifier reporting in nvme_init_ns_head
  nvme: rename __nvme_find_ns_head to nvme_find_ns_head
  nvme: refactor nvme_identify_ns_descs error handling
  nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
  nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
  nvme: Fix controller creation races with teardown flow
  nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
  nvme: Fix ctrl use-after-free during sysfs deletion
  nvme-pci: Re-order nvme_pci_free_ctrl
  nvme: Remove unused return code from nvme_delete_ctrl_sync
  nvme: Use nvme_state_terminal helper
  nvme: release ida resources
  nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
  nvmet-tcp: optimize tcp stack TX when data digest is used
  nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
  nvme-multipath: do not reset on unknown status
  nvmet-rdma: allocate RW ctxs according to mdts
  ...
2020-03-30 11:43:51 -07:00
Nikos Tsironis
81d5553d12 dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions()
dm_clone_nr_of_hydrated_regions() returns the number of regions that
have been hydrated so far. In order to do so it employs bitmap_weight().

Until now, the return type of dm_clone_nr_of_hydrated_regions() was
unsigned long.

Because bitmap_weight() returns an int, in case BITS_PER_LONG == 64 and
the return value of bitmap_weight() is 2^31 (the maximum allowed number
of regions for a device), the result is sign extended from 32 bits to 64
bits and an incorrect value is displayed, in the status output of
dm-clone, as the number of hydrated regions.

Fix this by having dm_clone_nr_of_hydrated_regions() return an unsigned
int.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-27 14:42:51 -04:00
Nikos Tsironis
9fc06ff568 dm clone: Add missing casts to prevent overflows and data corruption
Add missing casts when converting from regions to sectors.

In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
to overflows and miscalculation of the device sector.

As a result, we could end up discarding and/or copying the wrong parts
of the device, thus corrupting the device's data.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-27 14:42:25 -04:00
Nikos Tsironis
cd481c1226 dm clone: Add overflow check for number of regions
Add overflow check for clone->nr_regions variable, which holds the
number of regions of the target.

The overflow can occur with sufficiently large devices, if BITS_PER_LONG
== 32. E.g., if the region size is 8 sectors (4K), the overflow would
occur for device sizes > 34359738360 sectors (~16TB).

This could result in multiple device sectors wrongly mapping to the same
region number, due to the truncation from 64 bits to 32 bits, which
would lead to data corruption.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-27 14:41:46 -04:00
Nikos Tsironis
4b5142905d dm clone: Fix handling of partial region discards
There is a bug in the way dm-clone handles discards, which can lead to
discarding the wrong blocks or trying to discard blocks beyond the end
of the device.

This could lead to data corruption, if the destination device indeed
discards the underlying blocks, i.e., if the discard operation results
in the original contents of a block to be lost.

The root of the problem is the code that calculates the range of regions
covered by a discard request and decides which regions to discard.

Since dm-clone handles the device in units of regions, we don't discard
parts of a region, only whole regions.

The range is calculated as:

    rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
    re = bio_end_sector(bio) >> clone->region_shift;

, where 'rs' is the first region to discard and (re - rs) is the number
of regions to discard.

The bug manifests when we try to discard part of a single region, i.e.,
when we try to discard a block with size < region_size, and the discard
request both starts at an offset with respect to the beginning of that
region and ends before the end of the region.

The root cause is the following comparison:

  if (rs == re)
    // skip discard and complete original bio immediately

, which doesn't take into account that 'rs' might be greater than 're'.

Thus, we then issue a discard request for the wrong blocks, instead of
skipping the discard all together.

Fix the check to also take into account the above case, so we don't end
up discarding the wrong blocks.

Also, add some range checks to dm_clone_set_region_hydrated() and
dm_clone_cond_set_range(), which update dm-clone's region bitmap.

Note that the aforementioned bug doesn't cause invalid memory accesses,
because dm_clone_is_range_hydrated() returns True for this case, so the
checks are just precautionary.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-27 14:41:21 -04:00
Mikulas Patocka
1edaa447d9 dm writecache: add cond_resched to avoid CPU hangs
Initializing a dm-writecache device can take a long time when the
persistent memory device is large.  Add cond_resched() to a few loops
to avoid warnings that the CPU is stuck.

Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-27 14:36:50 -04:00
Christoph Hellwig
3d745ea5b0 block: simplify queue allocation
Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper.  Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 10:23:43 -06:00
Christoph Hellwig
ff27668ce8 bcache: pass the make_request methods to blk_queue_make_request
bcache is the only driver not actually passing its make_request
methods to blk_queue_make_request, but instead just sets them up
manually a little later.  Make bcache follow the common way of
setting up make_request based queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 10:23:43 -06:00
Christoph Hellwig
c6a564ffad block: move the part_stat* helpers from genhd.h to a new header
These macros are just used by a few files.  Move them out of genhd.h,
which is included everywhere into a new standalone header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:09 -06:00
Coly Li
5ae3a2c03d bcache: remove dupplicated declaration from btree.h
Commit 253a99d95d ("bcache: move macro btree() and btree_root()
into btree.h") makes two duplicated declaration into btree.h,
	typedef int (btree_map_keys_fn)();
	int bch_btree_map_keys();

The kbuild test robot <lkp@intel.com> detects and reports this
problem and this patch fixes it by removing the duplicated ones.

Fixes: 253a99d95d ("bcache: move macro btree() and btree_root() into btree.h")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 19:56:42 -06:00
Mikulas Patocka
31843edab7 dm integrity: improve discard in journal mode
When we discard something that is present in the journal, we flush the
journal first, so that discarded blocks are not overwritten by the journal
content.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 13:09:49 -04:00
Mikulas Patocka
84597a44a9 dm integrity: add optional discard support
Add an argument "allow_discards" that enables discard processing on
dm-integrity device. Discards are only allowed to devices using
internal hash.

When a block is discarded the integrity tag is filled with
DISCARD_FILLER (0xf6) bytes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 13:05:25 -04:00
Mikulas Patocka
1ac2c15a7b dm integrity: allow resize of the integrity device
If the size of the underlying device changes, change the size of the
integrity device too.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:52:53 -04:00
Mikulas Patocka
87fb177b4c dm integrity: factor out get_provided_data_sectors()
Move code to a new function get_provided_data_sectors().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:46:35 -04:00
Mikulas Patocka
f6f72f32c2 dm integrity: don't replay journal data past the end of the device
Following commits will make it possible to shrink or extend the device. If
the device was shrunk, we don't want to replay journal data pointing past
the end of the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:43:21 -04:00
Mikulas Patocka
7649194a16 dm integrity: remove sector type casts
Since the commit 72deb455b5 ("block:
remove CONFIG_LBDAF") sector_t is always defined as unsigned long
long.

Delete the needless type casts in printk and avoids some warnings if
DEBUG_PRINT is defined.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:40:18 -04:00
Mikulas Patocka
b93b6643e9 dm integrity: fix a crash with unusually large tag size
If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
there's a crash in integrity_metadata().

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:34:45 -04:00
Bob Liu
b8fdd09037 dm zoned: remove duplicate nr_rnd_zones increase in dmz_init_zone()
zmd->nr_rnd_zones was increased twice by mistake. The other place it
is increased in dmz_init_zone() is the only one needed:

1131                 zmd->nr_useable_zones++;
1132                 if (dmz_is_rnd(zone)) {
1133                         zmd->nr_rnd_zones++;
					^^^
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:21:48 -04:00
Shetty, Harshini X (EXT-Sony Mobile)
75fa601934 dm verity fec: fix memory leak in verity_fec_dtr
Fix below kmemleak detected in verity_fec_ctr. output_pool is
allocated for each dm-verity-fec device. But it is not freed when
dm-table for the verity target is removed. Hence free the output
mempool in destructor function verity_fec_dtr.

unreferenced object 0xffffffffa574d000 (size 4096):
  comm "init", pid 1667, jiffies 4294894890 (age 307.168s)
  hex dump (first 32 bytes):
    8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00  .6..f...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000060e82407>] __kmalloc+0x2b4/0x340
    [<00000000dd99488f>] mempool_kmalloc+0x18/0x20
    [<000000002560172b>] mempool_init_node+0x98/0x118
    [<000000006c3574d2>] mempool_init+0x14/0x20
    [<0000000008cb266e>] verity_fec_ctr+0x388/0x3b0
    [<000000000887261b>] verity_ctr+0x87c/0x8d0
    [<000000002b1e1c62>] dm_table_add_target+0x174/0x348
    [<000000002ad89eda>] table_load+0xe4/0x328
    [<000000001f06f5e9>] dm_ctl_ioctl+0x3b4/0x5a0
    [<00000000bee5fbb7>] do_vfs_ioctl+0x5dc/0x928
    [<00000000b475b8f5>] __arm64_sys_ioctl+0x70/0x98
    [<000000005361e2e8>] el0_svc_common+0xa0/0x158
    [<000000001374818f>] el0_svc_handler+0x6c/0x88
    [<000000003364e9f4>] el0_svc+0x8/0xc
    [<000000009d84cec9>] 0xffffffffffffffff

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Depends-on: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Harshini Shetty <harshini.x.shetty@sony.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 12:00:29 -04:00
Mikulas Patocka
dc8a01ae1d dm writecache: optimize superblock write
If we write a superblock in writecache_flush, we don't need to set bit and
scan the bitmap for it - we can just write the superblock directly. Also,
we can set the flag REQ_FUA on the write bio, so that we don't need to
submit a flush bio afterwards.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:55:09 -04:00
Mikulas Patocka
3923d4854e dm writecache: implement gradual cleanup
If a block is stored in the cache for too long, it will now be
written to the underlying device and cleaned up.

Add a new option "max_age" that specifies the maximum age of a block
in milliseconds.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:55:08 -04:00
Mikulas Patocka
93de44eb3f dm writecache: implement the "cleaner" policy
The "flush" or "flush_on_suspend" messages flush the whole cache. However,
these flushing methods can take some time and the process is left in
an interruptible state during the flush.

Implement a "cleaner" option that offers an alternate flushing method.
When this option is activated (either by a message or in the constructor
arguments), the cache will not promote new writes (however, writes to
already cached blocks are promoted, to avoid data corruption due to
misordered writes) and it will gradually writeback any cached data. The
userspace can then monitor the cleaning process with "dmsetup status".
When the number of cached bloks drops to zero, the userspace can unload
the dm-writecache target and replace it with dm-linear or other targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:50:30 -04:00
Mikulas Patocka
d53f1fafec dm writecache: do direct write if the cache is full
If the cache device is full, we do a direct write to the origin device.
Note that we must not do it if the written block is already in the cache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:26:18 -04:00
Erich Eckner
eaab4bde6e dm integrity: print device name in integrity_metadata() error message
Similar to f710126cfc ("dm crypt: print
device name in integrity error message"), this message should also
better identify the device with the integrity failure.

Signed-off-by: Erich Eckner <git@eckner.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:25:11 -04:00
Yang Yingliang
3fd53533a8 dm crypt: use crypt_integrity_aead() helper
Replace test_bit(CRYPT_MODE_INTEGRITY_AEAD, XXX) with
crypt_integrity_aead().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-24 11:17:33 -04:00
Christoph Hellwig
74cc979c3c block: cleanup how md_autodetect_dev is called
Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code.  Then use IS_BUILTIN to call it instead of the
ifdef magic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:08 -06:00
Christoph Hellwig
ea3edd4dc2 block: remove __bdevname
There is no good reason for __bdevname to exist.  Just open code
printing the string in the callers.  For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.

Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00