Commit Graph

8114 Commits

Author SHA1 Message Date
Alexander Lobakin
a37fbe666c bitmap: introduce generic optimized bitmap_size()
The number of times yet another open coded
`BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge.
Some generic helper is long overdue.

Add one, bitmap_size(), but with one detail.
BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both
divident and divisor are compile-time constants or when the divisor
is not a pow-of-2. When it is however, the compilers sometimes tend
to generate suboptimal code (GCC 13):

48 83 c0 3f          	add    $0x3f,%rax
48 c1 e8 06          	shr    $0x6,%rax
48 8d 14 c5 00 00 00 00	lea    0x0(,%rax,8),%rdx

%BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does
full division of `nbits + 63` by it and then multiplication by 8.
Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC:

8d 50 3f             	lea    0x3f(%rax),%edx
c1 ea 03             	shr    $0x3,%edx
81 e2 f8 ff ff 1f    	and    $0x1ffffff8,%edx

Now it shifts `nbits + 63` by 3 positions (IOW performs fast division
by 8) and then masks bits[2:0]. bloat-o-meter:

add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617)

Clang does it better and generates the same code before/after starting
from -O1, except that with the ALIGN() approach it uses %edx and thus
still saves some bytes:

add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520)

Note that we can't expand DIV_ROUND_UP() by adding a check and using
this approach there, as it's used in array declarations where
expressions are not allowed.
Add this helper to tools/ as well.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:28 +01:00
Arnd Bergmann
8e91c23423 dm integrity: fix out-of-range warning
Depending on the value of CONFIG_HZ, clang complains about a pointless
comparison:

drivers/md/dm-integrity.c:4085:12: error: result of comparison of
                        constant 42949672950 with expression of type
                        'unsigned int' is always false
                        [-Werror,-Wtautological-constant-out-of-range-compare]
                        if (val >= (uint64_t)UINT_MAX * 1000 / HZ) {

As the check remains useful for other configurations, shut up the
warning by adding a second type cast to uint64_t.

Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-29 09:48:07 -04:00
Ken Raeburn
d7e1201443 dm vdo murmurhash3: use kernel byteswapping routines instead of GCC ones
Also open-code the calls.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-29 09:45:54 -04:00
Linus Torvalds
64f799ffb4 - Fix a memory leak in DM integrity recheck code that was added during
the 6.9 merge. Also fix the recheck code to ensure it issues bios
   with proper alignment.
 
 - Fix DM snapshot's dm_exception_table_exit() to schedule while
   handling an large exception table during snapshot device shutdown.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmX9seYACgkQxSPxCi2d
 A1oKMwgA2xYlw4JooN7PIvGb7IiIfzXfdVmmtTiHBzuvhAOg7f014UbUoubnNz0z
 DogzvWVX4Svieh60EYVaaHdWWJWBQnNLO5qJCEmt4sMYXU9pnpJnfWCQiP3EHKL5
 8fVsoCeEK1Eh+eYghG7731zSaqqavwTYm3rlzcuq5hPa+cW6b2OgxzUJybEfsXhi
 9Y5lc7LUEuLuSRAnL/Msg4Rmo6eR45GOmY9EcULx0hf/plJrho9J5cp07KJ5ViXK
 ciYTGy3bkoVxKufepdINaKZyI2/3PEDZm2cTfidubhSj33Zgce2c7oJz+1F9Q9wf
 bixaB+mI2pMVwFzllrkxnl8yIGgoAw==
 =D74z
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix a memory leak in DM integrity recheck code that was added during
   the 6.9 merge. Also fix the recheck code to ensure it issues bios
   with proper alignment.

 - Fix DM snapshot's dm_exception_table_exit() to schedule while
   handling an large exception table during snapshot device shutdown.

* tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-integrity: align the outgoing bio in integrity_recheck
  dm snapshot: fix lockup in dm_exception_table_exit
  dm-integrity: fix a memory leak when rechecking the data
2024-03-22 12:34:26 -07:00
Linus Torvalds
1d35aae78f Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
 
  - Use more threads when building Debian packages in parallel
 
  - Fix warnings shown during the RPM kernel package uninstallation
 
  - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
    Makefile
 
  - Support GCC's -fmin-function-alignment flag
 
  - Fix a null pointer dereference bug in modpost
 
  - Add the DTB support to the RPM package
 
  - Various fixes and cleanups in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM
 VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC
 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz
 vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG
 ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP
 OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff
 LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p
 +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ
 FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ
 c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0
 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V
 dWH7BPecS44h4KXx
 =tFdl
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)

 - Use more threads when building Debian packages in parallel

 - Fix warnings shown during the RPM kernel package uninstallation

 - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
   Makefile

 - Support GCC's -fmin-function-alignment flag

 - Fix a null pointer dereference bug in modpost

 - Add the DTB support to the RPM package

 - Various fixes and cleanups in Kconfig

* tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits)
  kconfig: tests: test dependency after shuffling choices
  kconfig: tests: add a test for randconfig with dependent choices
  kconfig: tests: support KCONFIG_SEED for the randconfig runner
  kbuild: rpm-pkg: add dtb files in kernel rpm
  kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig()
  kconfig: check prompt for choice while parsing
  kconfig: lxdialog: remove unused dialog colors
  kconfig: lxdialog: fix button color for blackbg theme
  modpost: fix null pointer dereference
  kbuild: remove GCC's default -Wpacked-bitfield-compat flag
  kbuild: unexport abs_srctree and abs_objtree
  kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
  kconfig: remove named choice support
  kconfig: use linked list in get_symbol_str() to iterate over menus
  kconfig: link menus to a symbol
  kbuild: fix inconsistent indentation in top Makefile
  kbuild: Use -fmin-function-alignment when available
  alpha: merge two entries for CONFIG_ALPHA_GAMMA
  alpha: merge two entries for CONFIG_ALPHA_EV4
  kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj)
  ...
2024-03-21 14:41:00 -07:00
Mikulas Patocka
b4d78cfeb3 dm-integrity: align the outgoing bio in integrity_recheck
It is possible to set up dm-integrity with smaller sector size than
the logical sector size of the underlying device. In this situation,
dm-integrity guarantees that the outgoing bios have the same alignment as
incoming bios (so, if you create a filesystem with 4k block size,
dm-integrity would send 4k-aligned bios to the underlying device).

This guarantee was broken when integrity_recheck was implemented.
integrity_recheck sends bio that is aligned to ic->sectors_per_block. So
if we set up integrity with 512-byte sector size on a device with logical
block size 4k, we would be sending unaligned bio. This triggered a bug in
one of our internal tests.

This commit fixes it by determining the actual alignment of the
incoming bio and then makes sure that the outgoing bio in
integrity_recheck has the same alignment.

Fixes: c88f5e553f ("dm-integrity: recheck the integrity tag after a failure")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-21 13:19:10 -04:00
Mikulas Patocka
6e7132ed3c dm snapshot: fix lockup in dm_exception_table_exit
There was reported lockup when we exit a snapshot with many exceptions.
Fix this by adding "cond_resched" to the loop that frees the exceptions.

Reported-by: John Pittman <jpittman@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-20 14:31:24 -04:00
Mikulas Patocka
55e565c42d dm-integrity: fix a memory leak when rechecking the data
Memory for the "checksums" pointer will leak if the data is rechecked
after checksum failure (because the associated kfree won't happen due
to 'goto skip_io').

Fix this by freeing the checksums memory before recheck, and just use
the "checksum_onstack" memory for storing checksum during recheck.

Fixes: c88f5e553f ("dm-integrity: recheck the integrity tag after a failure")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-19 11:51:37 -04:00
Linus Torvalds
e5eb28f6d1 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
 
 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".
 
 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits.  The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".
 
 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".
 
 - Ryusuke Konishi continues nilfs2 maintenance work in the series
 
 	"nilfs2: eliminate kmap and kmap_atomic calls"
 	"nilfs2: fix kernel bug at submit_bh_wbc()"
 
 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".
 
 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".
 
 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".
 
 Plus the usual shower of singleton patches in various parts of the tree.
 Please see the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA
 jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh
 tKVmvihFxfAhdDthseXcIf1nBjMALwY=
 =8rVc
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
   heap optimizations".

 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".

 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits. The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".

 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".

 - Ryusuke Konishi continues nilfs2 maintenance work in the series

	"nilfs2: eliminate kmap and kmap_atomic calls"
	"nilfs2: fix kernel bug at submit_bh_wbc()"

 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".

 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".

 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".

Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.

* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  nilfs2: prevent kernel bug at submit_bh_wbc()
  nilfs2: fix failure to detect DAT corruption in btree and direct mappings
  ocfs2: enable ocfs2_listxattr for special files
  ocfs2: remove SLAB_MEM_SPREAD flag usage
  assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
  buildid: use kmap_local_page()
  watchdog/core: remove sysctl handlers from public header
  nilfs2: use div64_ul() instead of do_div()
  mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
  kexec: copy only happens before uchunk goes to zero
  get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
  get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
  get_signal: don't abuse ksig->info.si_signo and ksig->sig
  const_structs.checkpatch: add device_type
  Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
  dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
  list: leverage list_is_head() for list_entry_is_head()
  nilfs2: MAINTAINERS: drop unreachable project mirror site
  smp: make __smp_processor_id() 0-argument macro
  fat: fix uninitialized field in nostale filehandles
  ...
2024-03-14 18:03:09 -07:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
61387b8dcf - Introduce the DM vdo target which provides block-level
deduplication, compression, and thin provisioning. Please see both:
   Documentation/admin-guide/device-mapper/vdo.rst and
   Documentation/admin-guide/device-mapper/vdo-design.rst
 
 - The DM vdo target handles its concurrency by pinning an IO, and
   subsequent stages of handling that IO, to a particular VDO thread.
   This aspect of VDO is "unique" but its overall implementation is
   very tightly coupled to its mostly lockless threading model.
   As such, VDO is not easily changed to use more traditional
   finer-grained locking and Linux workqueues. Please see the "Zones
   and Threading" section of vdo-design.rst
 
 - The DM vdo target has been used in production for many years but has
   seen significant changes over the past ~6 years to prepare it for
   upstream inclusion. The codebase is still large but it is isolated
   to drivers/md/dm-vdo/ and has been made considerably more
   approachable and maintainable.
 
 - Matt Sakai has been added to the MAINTAINERS file to reflect that he
   will send VDO changes upstream through the DM subsystem maintainers.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXqYlEACgkQxSPxCi2d
 A1oOAggAlPdoDfU2DWoFxJcCGR4sYrk6n8GOGXZIi5bZeJFaPUoH9EQZUVgsMiEo
 TSpJugk6KddsXNsKHCuc3ctejNT4FAs3f5pcqWQ+VTR6xXD7xw67hN78arhu4VZH
 OWzAyKu8fs4NUxjMThLdwMlt2+vOfcuOHAXxcil3kvG6SPcPw1FDde3Jbb5OmgcF
 z1D9kkUuqH+d46P/dwOsNcr7cMIUviZW+plFnVMKsdJGH/SCyu6TCLYWrwBR1VXW
 pNylxk4AcsffQdu2Oor6+0ALaH7Wq3kjymNLQlYWt0EibGbBakrVs1A/DIXUZDiX
 gYfcemQpl74x1vifNtIsUWW9vO1XPg==
 =2Oof
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/dm-vdo' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper VDO target from Mike Snitzer:
 "Introduce the DM vdo target which provides block-level deduplication,
  compression, and thin provisioning. Please see:

      Documentation/admin-guide/device-mapper/vdo.rst
      Documentation/admin-guide/device-mapper/vdo-design.rst

  The DM vdo target handles its concurrency by pinning an IO, and
  subsequent stages of handling that IO, to a particular VDO thread.
  This aspect of VDO is "unique" but its overall implementation is very
  tightly coupled to its mostly lockless threading model. As such, VDO
  is not easily changed to use more traditional finer-grained locking
  and Linux workqueues. Please see the "Zones and Threading" section of
  vdo-design.rst

  The DM vdo target has been used in production for many years but has
  seen significant changes over the past ~6 years to prepare it for
  upstream inclusion. The codebase is still large but it is isolated to
  drivers/md/dm-vdo/ and has been made considerably more approachable
  and maintainable.

  Matt Sakai has been added to the MAINTAINERS file to reflect that he
  will send VDO changes upstream through the DM subsystem maintainers"

* tag 'for-6.9/dm-vdo' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (142 commits)
  dm vdo: document minimum metadata size requirements
  dm vdo: remove meaningless version number constant
  dm vdo: remove vdo_perform_once
  dm vdo block-map: Remove stray semicolon
  dm vdo string-utils: change from uds_ to vdo_ namespace
  dm vdo logger: change from uds_ to vdo_ namespace
  dm vdo funnel-queue: change from uds_ to vdo_ namespace
  dm vdo indexer: fix use after free
  dm vdo logger: remove log level to string conversion code
  dm vdo: document log_level parameter
  dm vdo: add 'log_level' module parameter
  dm vdo: remove all sysfs interfaces
  dm vdo target: eliminate inappropriate uses of UDS_SUCCESS
  dm vdo indexer: update ASSERT and ASSERT_LOG_ONLY usage
  dm vdo encodings: update some stale comments
  dm vdo permassert: audit all of ASSERT to test for VDO_SUCCESS
  dm-vdo funnel-workqueue: return VDO_SUCCESS from make_simple_work_queue
  dm vdo thread-utils: return VDO_SUCCESS on vdo_create_thread success
  dm vdo int-map: return VDO_SUCCESS on success
  dm vdo: check for VDO_SUCCESS return value from memory-alloc functions
  ...
2024-03-13 09:57:23 -07:00
Linus Torvalds
c0499a0812 - Convert the DM verity and crypt targets from (ab)using tasklets to
using BH workqueues.  These changes were coordinated with Tejun and
   are based ontop of DM's 6.9 changes and Tejun's 6.9 workqueue tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXqBRAACgkQxSPxCi2d
 A1pAIAf+PKFi4KlNvJlJehKzCMBxofFUK+DMxePqbhHudN1v6z2Q0lMbsa0wpCly
 eql2SQQPRhqI4j7WhhpY6uXPpxZ9TKrDsIFiOTka6kZBwIlVafMDwtb+16/PVSHa
 le7ngtCaY4EFu0aPpVV8oW5TGJSgmGsv+KvgZ/Op+ugmbOKAraMvVaGb89xh6WQN
 ngUoLqEgF0br79UWKYJ2YtjsgK1h2Ra+Nu8eh/FV3dAOtnJtNJ5Ph0MLz35dGQvv
 0djeDy/DGmWKeNxUdqQPzgguPG84P5Pfg6xbZa7bqqv4ujQOp2YIX4dxkjM2w2Aj
 III+WJ5XHaLOE1JutwZCVyuabNApMA==
 =MSmO
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/dm-bh-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper BH workqueue conversion from Mike Snitzer:
 "Convert the DM verity and crypt targets from (ab)using tasklets to
  using BH workqueues.

  These changes were coordinated with Tejun and are based ontop of DM's
  6.9 changes and Tejun's 6.9 workqueue tree"

* tag 'for-6.9/dm-bh-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-verity: Convert from tasklet to BH workqueue
  dm-crypt: Convert from tasklet to BH workqueue
2024-03-13 09:46:51 -07:00
Linus Torvalds
d2bac0823d - Fix DM core's IO submission (which include dm-io and dm-bufio) such
that a bio's IO priority is propagated. Work focused on enabling
   both DM crypt and verity targets to retain the appropriate IO
   priority.
 
 - Fix DM raid reshape logic to not allow an empty flush bio to be
   requeued due to false concern about the bio, which doesn't have a
   data payload, accessing beyond the end of the device.
 
 - Fix DM core's internal resume so that it properly calls both presume
   and resume methods, which fixes the potential for a postsuspend and
   resume imbalance.
 
 - Update DM verity target to set DM_TARGET_SINGLETON flag because it
   doesn't make sense to have a DM table with a mix of targets that
   include dm-verity.
 
 - Small cleanups in DM crypt, thin, and integrity targets.
 
 - Fix references to dm-devel mailing list to use latest list address.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXwWHkACgkQxSPxCi2d
 A1oavggAtmftaoAXUJNdiYgIDm6/gKPa0CDyQEoD5cTyQ32R7c6SYNN7u2wrltHX
 af36lzoksfsOra8uMa4+Q46/XFeqlRzvfu29bhmfAWMkMT3MMq7iGtTFG/SluD7b
 NWjXhsUT8Fv2Q7BKglUc6cXUIbCZNwNUs5cxx2QobdrD57qxVKDz5HkH5/EggptA
 cA6dpD7DbpWQhHWm0UN6cOmYNh4kGLiQs4S50N5hc7zlbXfJhhVzflecsVPY+MVN
 wS/iv/hNenlLuJ7gzIPBwmxRgBbjHHbrbFNVa2yFrPQ8m/AuRbAUqYQO1sOXNOMZ
 Mkk2G0IB7sJtpnEm+6l0x7A4VmSWvg==
 =Eoxd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Fix DM core's IO submission (which include dm-io and dm-bufio) such
   that a bio's IO priority is propagated. Work focused on enabling both
   DM crypt and verity targets to retain the appropriate IO priority

 - Fix DM raid reshape logic to not allow an empty flush bio to be
   requeued due to false concern about the bio, which doesn't have a
   data payload, accessing beyond the end of the device

 - Fix DM core's internal resume so that it properly calls both presume
   and resume methods, which fixes the potential for a postsuspend and
   resume imbalance

 - Update DM verity target to set DM_TARGET_SINGLETON flag because it
   doesn't make sense to have a DM table with a mix of targets that
   include dm-verity

 - Small cleanups in DM crypt, thin, and integrity targets

 - Fix references to dm-devel mailing list to use latest list address

* tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: call the resume method on internal suspend
  dm raid: fix false positive for requeue needed during reshape
  dm-integrity: set max_integrity_segments in dm_integrity_io_hints
  dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list
  dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list
  dm verity: set DM_TARGET_SINGLETON feature flag
  dm crypt: Fix IO priority lost when queuing write bios
  dm verity: Fix IO priority lost when reading FEC and hash
  dm bufio: Support IO priority
  dm io: Support IO priority
  dm crypt: remove redundant state settings after waking up
  dm thin: add braces around conditional code that spans lines
2024-03-13 09:38:10 -07:00
Mikulas Patocka
65e8fbde64 dm: call the resume method on internal suspend
There is this reported crash when experimenting with the lvm2 testsuite.
The list corruption is caused by the fact that the postsuspend and resume
methods were not paired correctly; there were two consecutive calls to the
origin_postsuspend function. The second call attempts to remove the
"hash_list" entry from a list, while it was already removed by the first
call.

Fix __dm_internal_resume so that it calls the preresume and resume
methods of the table's targets.

If a preresume method of some target fails, we are in a tricky situation.
We can't return an error because dm_internal_resume isn't supposed to
return errors. We can't return success, because then the "resume" and
"postsuspend" methods would not be paired correctly. So, we set the
DMF_SUSPENDED flag and we fake normal suspend - it may confuse userspace
tools, but it won't cause a kernel crash.

------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 8343 Comm: dmsetup Not tainted 6.8.0-rc6 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:__list_del_entry_valid_or_report+0x77/0xc0
<snip>
RSP: 0018:ffff8881b831bcc0 EFLAGS: 00010282
RAX: 000000000000004e RBX: ffff888143b6eb80 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff819053d0 RDI: 00000000ffffffff
RBP: ffff8881b83a3400 R08: 00000000fffeffff R09: 0000000000000058
R10: 0000000000000000 R11: ffffffff81a24080 R12: 0000000000000001
R13: ffff88814538e000 R14: ffff888143bc6dc0 R15: ffffffffa02e4bb0
FS:  00000000f7c0f780(0000) GS:ffff8893f0a40000(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000057fb5000 CR3: 0000000143474000 CR4: 00000000000006b0
Call Trace:
 <TASK>
 ? die+0x2d/0x80
 ? do_trap+0xeb/0xf0
 ? __list_del_entry_valid_or_report+0x77/0xc0
 ? do_error_trap+0x60/0x80
 ? __list_del_entry_valid_or_report+0x77/0xc0
 ? exc_invalid_op+0x49/0x60
 ? __list_del_entry_valid_or_report+0x77/0xc0
 ? asm_exc_invalid_op+0x16/0x20
 ? table_deps+0x1b0/0x1b0 [dm_mod]
 ? __list_del_entry_valid_or_report+0x77/0xc0
 origin_postsuspend+0x1a/0x50 [dm_snapshot]
 dm_table_postsuspend_targets+0x34/0x50 [dm_mod]
 dm_suspend+0xd8/0xf0 [dm_mod]
 dev_suspend+0x1f2/0x2f0 [dm_mod]
 ? table_deps+0x1b0/0x1b0 [dm_mod]
 ctl_ioctl+0x300/0x5f0 [dm_mod]
 dm_compat_ctl_ioctl+0x7/0x10 [dm_mod]
 __x64_compat_sys_ioctl+0x104/0x170
 do_syscall_64+0x184/0x1b0
 entry_SYSCALL_64_after_hwframe+0x46/0x4e
RIP: 0033:0xf7e6aead
<snip>
---[ end trace 0000000000000000 ]---

Fixes: ffcc393641 ("dm: enhance internal suspend and resume interface")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-12 09:27:42 -04:00
Ming Lei
b25b8f4b8e dm raid: fix false positive for requeue needed during reshape
An empty flush doesn't have a payload, so it should never be looked at
when considering to possibly requeue a bio for the case when a reshape
is in progress.

Fixes: 9dbd1aa3a8 ("dm raid: add reshaping support to the target")
Reported-by: Patrick Plenefisch <simonpatp@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-12 09:27:42 -04:00
Linus Torvalds
bff4b74625 Revert "dm: use queue_limits_set"
This reverts commit 8e0ef41286.

It's broken, and causes the boot to fail on encrypted volumes.

Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11 17:11:28 -07:00
Linus Torvalds
1ddeeb2a05 for-6.9/block-20240310
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB
 kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK
 ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg
 Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE
 V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi
 Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc
 nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG
 DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR
 S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU
 fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ
 INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo
 VLHGV1Ncgw==
 =WlVL
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull requests via Song:
      - Cleanup redundant checks (Yu Kuai)
      - Remove deprecated headers (Marc Zyngier, Song Liu)
      - Concurrency fixes (Li Lingfeng)
      - Memory leak fix (Li Nan)
      - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
      - Clean up and fix for md_ioctl (Li Nan)
      - Other small fixes (Gui-Dong Han, Heming Zhao)
      - MD atomic limits (Christoph)

 - NVMe pull request via Keith:
      - RDMA target enhancements (Max)
      - Fabrics fixes (Max, Guixin, Hannes)
      - Atomic queue_limits usage (Christoph)
      - Const use for class_register (Ricardo)
      - Identification error handling fixes (Shin'ichiro, Keith)

 - Improvement and cleanup for cached request handling (Christoph)

 - Moving towards atomic queue limits. Core changes and driver bits so
   far (Christoph)

 - Fix UAF issues in aoeblk (Chun-Yi)

 - Zoned fix and cleanups (Damien)

 - s390 dasd cleanups and fixes (Jan, Miroslav)

 - Block issue timestamp caching (me)

 - noio scope guarding for zoned IO (Johannes)

 - block/nvme PI improvements (Kanchan)

 - Ability to terminate long running discard loop (Keith)

 - bdev revalidation fix (Li)

 - Get rid of old nr_queues hack for kdump kernels (Ming)

 - Support for async deletion of ublk (Ming)

 - Improve IRQ bio recycling (Pavel)

 - Factor in CPU capacity for remote vs local completion (Qais)

 - Add shared_tags configfs entry for null_blk (Shin'ichiro

 - Fix for a regression in page refcounts introduced by the folio
   unification (Tony)

 - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
   Ricardo, Roman, Tang, Uwe)

* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
  block/swim: Convert to platform remove callback returning void
  cdrom: gdrom: Convert to platform remove callback returning void
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
  bcache: move calculation of stripe_size and io_opt into bcache_device_init
  virtio_blk: Do not use disk_set_max_open/active_zones()
  aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
  block: move capacity validation to blkpg_do_ioctl()
  block: prevent division by zero in blk_rq_stat_sum()
  drbd: atomically update queue limits in drbd_reconsider_queue_parameters
  ...
2024-03-11 11:43:44 -07:00
Linus Torvalds
910202f00a vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
 ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
 9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
 =Y1qX
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Yu Kuai
fcf3f7e2fc raid1: fix use-after-free for original bio in raid1_write_request()
r1_bio->bios[] is used to record new bios that will be issued to
underlying disks, however, in raid1_write_request(), r1_bio->bios[]
will set to the original bio temporarily. Meanwhile, if blocked rdev
is set, free_r1bio() will be called causing that all r1_bio->bios[]
to be freed:

raid1_write_request()
 r1_bio = alloc_r1bio(mddev, bio); -> r1_bio->bios[] is NULL
 for (i = 0;  i < disks; i++) -> for each rdev in conf
  // first rdev is normal
  r1_bio->bios[0] = bio; -> set to original bio
  // second rdev is blocked
  if (test_bit(Blocked, &rdev->flags))
   break

 if (blocked_rdev)
  free_r1bio()
   put_all_bios()
    bio_put(r1_bio->bios[0]) -> original bio is freed

Test scripts:

mdadm -CR /dev/md0 -l1 -n4 /dev/sd[abcd] --assume-clean
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 \
    -iodepth=128 -name=test -direct=1
echo blocked > /sys/block/md0/md/rd2/state

Test result:

BUG bio-264 (Not tainted): Object already free
-----------------------------------------------------------------------------

Allocated in mempool_alloc_slab+0x24/0x50 age=1 cpu=1 pid=869
 kmem_cache_alloc+0x324/0x480
 mempool_alloc_slab+0x24/0x50
 mempool_alloc+0x6e/0x220
 bio_alloc_bioset+0x1af/0x4d0
 blkdev_direct_IO+0x164/0x8a0
 blkdev_write_iter+0x309/0x440
 aio_write+0x139/0x2f0
 io_submit_one+0x5ca/0xb70
 __do_sys_io_submit+0x86/0x270
 __x64_sys_io_submit+0x22/0x30
 do_syscall_64+0xb1/0x210
 entry_SYSCALL_64_after_hwframe+0x6c/0x74
Freed in mempool_free_slab+0x1f/0x30 age=1 cpu=1 pid=869
 kmem_cache_free+0x28c/0x550
 mempool_free_slab+0x1f/0x30
 mempool_free+0x40/0x100
 bio_free+0x59/0x80
 bio_put+0xf0/0x220
 free_r1bio+0x74/0xb0
 raid1_make_request+0xadf/0x1150
 md_handle_request+0xc7/0x3b0
 md_submit_bio+0x76/0x130
 __submit_bio+0xd8/0x1d0
 submit_bio_noacct_nocheck+0x1eb/0x5c0
 submit_bio_noacct+0x169/0xd40
 submit_bio+0xee/0x1d0
 blkdev_direct_IO+0x322/0x8a0
 blkdev_write_iter+0x309/0x440
 aio_write+0x139/0x2f0

Since that bios for underlying disks are not allocated yet, fix this
problem by using mempool_free() directly to free the r1_bio.

Fixes: 992db13a4a ("md/raid1: free the r1bio before waiting for blocked rdev")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240308093726.1047420-1-yukuai1@huaweicloud.com
2024-03-08 09:44:22 -08:00
Jens Axboe
d37977f0af Merge tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
Pull MD atomic queue limits changes from Song.

* tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
2024-03-06 11:15:24 -07:00
Christoph Hellwig
f30e5ed130 dm-integrity: set max_integrity_segments in dm_integrity_io_hints
Set max_integrity_segments with the other queue limits instead
of updating it later.  This also uncovered that the driver is trying
to set the limit to UINT_MAX while max_integrity_segments is an
unsigned short, so fix it up to use USHRT_MAX instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-06 12:21:27 -05:00
Christoph Hellwig
396799eb5b md: remove mddev->queue
Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
81a16e19d5 md: don't initialize queue limits
Initial queue limits are now set from ->run.  Remove the superfluous
initialization in md_alloc and level_store.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-10-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
3d8466ba68 md/raid10: use the atomic queue limit update APIs
Build the queue limits outside the queue and apply them using
queue_limits_set.   To make the code more obvious also split the queue
limits handling into separate helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-9-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
f63f17350e md/raid5: use the atomic queue limit update APIs
Build the queue limits outside the queue and apply them using
queue_limits_set.  To make the code more obvious also split the queue
limits handling into separate helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-8-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
97894f7d3c md/raid1: use the atomic queue limit update APIs
Build the queue limits outside the queue and apply them using
queue_limits_set.  To make the code more obvious also split the queue
limits handling into a separate helper function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-7-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
56cf22d6f6 md/raid0: use the atomic queue limit update APIs
Build the queue limits outside the queue and apply them using
queue_limits_set.  To make the code more obvious also split the queue
limits handling into a separate helper function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-6-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
e305fce188 md: add queue limit helpers
Add a few helpers that wrap the block queue limits API for use in MD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
176df894d7 md: add a mddev_is_dm helper
Add a helper to check for a DM-mapped MD device instead of using
the obfuscated ->gendisk or ->queue NULL checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de
2024-03-06 08:59:53 -08:00
Christoph Hellwig
28be4fd310 md: add a mddev_add_trace_msg helper
Add a small wrapper around blk_add_trace_msg that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de
2024-03-06 08:59:52 -08:00
Christoph Hellwig
c396b90e50 md: add a mddev_trace_remap helper
Add a helper to trace bio remapping that hides some argument
dereferences and the check for a DM-mapped MD device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de
2024-03-06 08:59:52 -08:00
Christoph Hellwig
34a2cf3fbe bcache: move calculation of stripe_size and io_opt into bcache_device_init
bcache currently calculates the stripe size for the non-cached_dev
case directly in bcache_device_init, but for the cached_dev case it does
it in the caller.  Consolidate it in one places, which also enables
setting the io_opt queue_limit before allocating the gendisk so that it
can be passed in instead of changing the limit just after the allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20240226104826.283067-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:34:19 -07:00
Song Liu
3a889fdce7 Merge branch 'dmraid-fix-6.9' into md-6.9
This is the second half of fixes for dmraid. The first half is available
at [1].

This set contains fixes:
 - reshape can start unexpected, cause data corruption, patch 1,5,6;
 - deadlocks that reshape concurrent with IO, patch 8;
 - a lockdep warning, patch 9;

For all the dmraid related tests in lvm2 suite, there is no new
regressions compared against 6.6 kernels (which is good baseline before
recent regressions).

[1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/

* dmraid-fix-6.9:
  dm-raid: fix lockdep waring in "pers->hot_add_disk"
  dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape
  dm-raid: add a new helper prepare_suspend() in md_personality
  md/dm-raid: don't call md_reap_sync_thread() directly
  dm-raid: really frozen sync_thread during suspend
  md: add a new helper reshape_interrupted()
  md: export helper md_is_rdwr()
  md: export helpers to stop sync_thread
  md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
2024-03-05 12:53:55 -08:00
Yu Kuai
95009ae904 dm-raid: fix lockdep waring in "pers->hot_add_disk"
The lockdep assert is added by commit a448af25be ("md/raid10: remove
rcu protection to access rdev from conf") in print_conf(). And I didn't
notice that dm-raid is calling "pers->hot_add_disk" without holding
'reconfig_mutex'.

"pers->hot_add_disk" read and write many fields that is protected by
'reconfig_mutex', and raid_resume() already grab the lock in other
contex. Hence fix this problem by protecting "pers->host_add_disk"
with the lock.

Fixes: 9092c02d94 ("DM RAID: Add ability to restore transiently failed devices on resume")
Fixes: a448af25be ("md/raid10: remove rcu protection to access rdev from conf")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-10-yukuai1@huaweicloud.com
2024-03-05 12:53:33 -08:00
Yu Kuai
41425f96d7 dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape
For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f07 ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:

[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

This is because reshape can't make progress.

For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
   'reconfig_mutex', hence it can be cleared and reshape can continue by
   sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch on the one hand make sure raid_message() can't change
sync_thread() through raid_message() after presuspend(), on the other
hand detect the above 3 cases before wait for IO do be done in
dm_suspend(), and let dm-raid requeue those IO.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
2024-03-05 12:53:33 -08:00
Yu Kuai
5625ff8b72 dm-raid: add a new helper prepare_suspend() in md_personality
There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com
2024-03-05 12:53:33 -08:00
Yu Kuai
cd32b27a66 md/dm-raid: don't call md_reap_sync_thread() directly
Currently md_reap_sync_thread() is called from raid_message() directly
without holding 'reconfig_mutex', this is definitely unsafe because
md_reap_sync_thread() can change many fields that is protected by
'reconfig_mutex'.

However, hold 'reconfig_mutex' here is still problematic because this
will cause deadlock, for example, commit 130443d60b ("md: refactor
idle/frozen_sync_thread() to fix deadlock").

Fix this problem by using stop_sync_thread() to unregister sync_thread,
like md/raid did.

Fixes: be83651f00 ("DM RAID: Add message/status support for changing sync action")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-7-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Yu Kuai
16c4770c75 dm-raid: really frozen sync_thread during suspend
1) commit f52f5c71f3 ("md: fix stopping sync thread") remove
   MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that
   dm-raid relies on __md_stop_writes() to frozen sync_thread
   indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in
   md_stop_writes(), and since stop_sync_thread() is only used for
   dm-raid in this case, also move stop_sync_thread() to
   md_stop_writes().
2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen,
   it only prevent new sync_thread to start, and it can't stop the
   running sync thread; In order to frozen sync_thread, after seting the
   flag, stop_sync_thread() should be used.
3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use
   it as condition for md_stop_writes() in raid_postsuspend() doesn't
   look correct. Consider that reentrant stop_sync_thread() do nothing,
   always call md_stop_writes() in raid_postsuspend().
4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime,
   and if MD_RECOVERY_FROZEN is cleared while the array is suspended,
   new sync_thread can start unexpected. Fix this by disallow
   raid_message() to change sync_thread status during suspend.

Note that after commit f52f5c71f3 ("md: fix stopping sync thread"), the
test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(),
and with previous fixes, the test won't hang there anymore, however, the
test will still fail and complain that ext4 is corrupted. And with this
patch, the test won't hang due to stop_sync_thread() or fail due to ext4
is corrupted anymore. However, there is still a deadlock related to
dm-raid456 that will be fixed in following patches.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/
Fixes: 1af2048a3e ("dm raid: fix deadlock caused by premature md_stop_writes()")
Fixes: 9dbd1aa3a8 ("dm raid: add reshaping support to the target")
Fixes: f52f5c71f3 ("md: fix stopping sync thread")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Yu Kuai
503f9d4379 md: add a new helper reshape_interrupted()
The helper will be used for dm-raid456 later to detect the case that
reshape can't make progress.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Yu Kuai
314e9af065 md: export helper md_is_rdwr()
There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Yu Kuai
7a2347e284 md: export helpers to stop sync_thread
Add new helpers:

  void md_idle_sync_thread(struct mddev *mddev);
  void md_frozen_sync_thread(struct mddev *mddev);
  void md_unfrozen_sync_thread(struct mddev *mddev);

The helpers will be used in dm-raid in later patches to fix regressions
and prevent calling md_reap_sync_thread() directly.

Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Yu Kuai
2f03d0c2cd md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
After commit 9dbd1aa3a8 ("dm raid: add reshaping support to the
target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and
expect to keep array frozen until resume. However, md_run() will clear
the flag by setting mddev->recovery to 0.

Before commit 1baae052cc ("md: Don't ignore suspended array in
md_check_recovery()"), dm-raid actually relied on suspending to prevent
starting new sync_thread.

Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in
md_run().

Fixes: 1baae052cc ("md: Don't ignore suspended array in md_check_recovery()")
Fixes: 9dbd1aa3a8 ("dm raid: add reshaping support to the target")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com
2024-03-05 12:53:32 -08:00
Song Liu
3445139e3a Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""
This reverts commit bed9e27baf.

The original set [1][2] was expected to undo a suboptimal fix in [2], and
replace it with a better fix [1]. However, as reported by Dan Moulding [2]
causes an issue with raid5 with journal device.

Revert [2] for now to close the issue. We will follow up on another issue
reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a
good trade-off, because the latter issue happens less freqently.

In the meanwhile, we will NOT revert [1], as it contains the right logic.

[1] commit d6e035aad6 ("md: bypass block throttle for superblock update")
[2] commit bed9e27baf ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")

Reported-by: Dan Moulding <dan@danm.net>
Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/
Fixes: bed9e27baf ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")
Cc: stable@vger.kernel.org # v5.19+
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240125082131.788600-1-song@kernel.org
2024-03-05 12:52:59 -08:00
Matthew Sakai
2a7f925bc2 dm vdo: remove meaningless version number constant
Also remove related log messages.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 23:11:11 -05:00
Matthew Sakai
7eb30fe18f dm vdo: remove vdo_perform_once
Remove obsolete function vdo_perform_once. Instead, initialize
necessary module state when loading the module.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 23:11:11 -05:00
Yang Li
d0464d8287 dm vdo block-map: Remove stray semicolon
Remove the unnecessary semicolon at the end of the for statement.

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:57 -05:00
Mike Snitzer
900d337b46 dm vdo string-utils: change from uds_ to vdo_ namespace
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chung Chung <cchung@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:57 -05:00
Mike Snitzer
3584240b9c dm vdo logger: change from uds_ to vdo_ namespace
Rename all uds_log_* to vdo_log_*.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chung Chung <cchung@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:57 -05:00
Mike Snitzer
66214ed000 dm vdo funnel-queue: change from uds_ to vdo_ namespace
Also return VDO_SUCCESS from vdo_make_funnel_queue.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chung Chung <cchung@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:57 -05:00
Matthew Sakai
41c58a36e2 dm vdo indexer: fix use after free
Fixes: b46d79bdb8 ("dm vdo: add deduplication index storage interface")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:57 -05:00
Mike Snitzer
7979d90757 dm vdo logger: remove log level to string conversion code
Was only used by sysfs code, can be reinstated if/when needed.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:57 -05:00
Mike Snitzer
25315e967a dm vdo: add 'log_level' module parameter
Expose control over dm-vdo's log-level in terms of a module param. It
can be read and written via /sys/module/dm_vdo/parameters/log_level.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
a9da0fb6d8 dm vdo: remove all sysfs interfaces
Also update target major version number.

All info is (or will be) accessible through alternative interfaces
(e.g. "dmsetup message", module params, etc).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
4e4152482b dm vdo target: eliminate inappropriate uses of UDS_SUCCESS
Most should be VDO_SUCCESS.  But comparing the return from
kstrtouint() with UDS_SUCCESS (happens to be 0) makes no sense.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Matthew Sakai
e60167367e dm vdo indexer: update ASSERT and ASSERT_LOG_ONLY usage
Update indexer uses of ASSERT and ASSERT_LOG_ONLY to
VDO_ASSERT and VDO_ASSERT_LOG_ONLY, respectively. Remove
ASSERT and ASSERT_LOG_ONLY. Also rename uds_assertion_failed
to vdo_assertion_failed.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:56 -05:00
Mike Snitzer
fc03f73760 dm vdo encodings: update some stale comments
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
6a79248b42 dm vdo permassert: audit all of ASSERT to test for VDO_SUCCESS
Also rename ASSERT to VDO_ASSERT and ASSERT_LOG_ONLY to
VDO_ASSERT_LOG_ONLY.

But re-introduce ASSERT and ASSERT_LOG_ONLY as a placeholder
for the benefit of dm-vdo/indexer.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
a958c53af7 dm-vdo funnel-workqueue: return VDO_SUCCESS from make_simple_work_queue
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
34edf9e28c dm vdo thread-utils: return VDO_SUCCESS on vdo_create_thread success
Update all callers to check for VDO_SUCCESS.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
6c43cf2488 dm vdo int-map: return VDO_SUCCESS on success
Update all callers to check for VDO_SUCCESS (most already did).
Also fix whitespace for update_mapping() parameters.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
2de70388b3 dm vdo: check for VDO_SUCCESS return value from memory-alloc functions
VDO_SUCCESS and UDS_SUCCESS were used interchangably, update all
callers of VDO's memory-alloc functions to consistently check for
VDO_SUCCESS.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
97d3380396 dm vdo memory-alloc: return VDO_SUCCESS on success
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Matthew Sakai
ee8f6ec1b1 dm vdo errors: remove unused error codes
Also define VDO_SUCCESS in a more central location, and
rename error block constants for clarity.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:56 -05:00
Mike Snitzer
8f89115efc dm vdo memory-alloc: rename vdo_do_allocation to __vdo_do_allocation
__vdo_do_allocation shouldn't be used outside of memory-alloc.h, so
add hidden prefix.

Also, tabify the vdo_allocate_extended macro.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Mike Snitzer
0eea6b6e78 dm vdo memory-alloc: change from uds_ to vdo_ namespace
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:56 -05:00
Bruce Johnston
6008d526b0 dm-vdo: change unnamed enums to defines
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:56 -05:00
Matthew Sakai
04530b487b dm vdo: remove outdated pointer_map reference
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:56 -05:00
Matthew Sakai
e1e510fcad dm vdo: update module comments
Update outdated comments referring to separate VDO and UDS
modules.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Matthew Sakai
bbe434d94e dm vdo indexer delta-index: fix typos in comments
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Jiapeng Chong
eebd4e1630 dm vdo: fix various function names referenced in comment blocks
No functional modification involved.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Mike Snitzer
17b1a73fea dm vdo: move indexer files into sub-directory
The goal is to assist high-level understanding of which code is
conceptually specific to VDO's indexer.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:55 -05:00
Matthew Sakai
61234f0bda dm vdo: remove unnecessary indexer.h includes
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Chung Chung
81c751ad1b dm vdo: clean up scnprintf usage
Ignore scnprintf return status since it is not necessary. Change
write_* functions type from int to void since we no longer return
any result. Also, clean up any code that checks or uses any scnprintf
return results.
Check uds_allocate return code which was previous ignored, return
and log error when uds_allocate failed.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Chung Chung <cchung@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Mike Snitzer
20be466c7a dm vdo: include <asm/current.h> to resolve current being undeclared
Reported when building on loongarch.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:55 -05:00
Mike Snitzer
444d3f0bfd dm vdo indexer-volume: fix missing mutex_lock in process_entry
Must mutex_lock after dm_bufio_read, before dm_bufio_read error
handling, otherwise process_entry error path will return without
volume->read_threads_mutex held. This fixes potential double
mutex_unlock.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:55 -05:00
Mike Snitzer
b259c1a60c dm vdo flush: initialize return to NULL in allocate_flush
Otherwise, error path could result in allocate_flush's subsequent
check for flush being non-NULL leading to false positive.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Dan Carpenter
672fc9b8c0 dm vdo slab-depot: delete unnecessary check in allocate_components
This is a duplicate check so it can't be true.  Delete it.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Mike Snitzer
924553644a dm vdo memory-alloc: simplify allocations_allowed()
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-04 15:07:55 -05:00
Susan LeGendre-McGhee
dcd1332bb5 dm vdo: remove internal ticket references
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-04 15:07:55 -05:00
Tejun Heo
c375b22333 dm-verity: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This commit converts dm-verity from tasklet to BH workqueue. It
backfills tasklet code that was removed with commit 0a9bab391e
("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH
workqueue (and does some renaming).

This is a minimal conversion which doesn't rename the related names
including the "try_verify_in_tasklet" option. If this patch is applied, a
follow-up patch would be necessary. I couldn't decide whether the option
name would need to be updated too.

Signed-off-by: Tejun Heo <tj@kernel.org>
[snitzer: rename 'use_tasklet' to 'use_bh_wq' and 'in_tasklet' to 'in_bh']
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-02 10:30:36 -05:00
Tejun Heo
fb6ad4aec1 dm-crypt: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This commit converts dm-crypt from tasklet to BH workqueue.  It
backfills tasklet code that was removed with commit 0a9bab391e
("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH
workqueue.

Like a regular workqueue, a BH workqueue allows freeing the currently
executing work item. Converting from tasklet to BH workqueue removes the
need for deferring bio_endio() again to a work item, which was buggy anyway.

I tested this lightly with "--perf-no_read_workqueue
--perf-no_write_workqueue" + some code modifications, but would really
-appreciate if someone who knows the code base better could take a look.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/82b964f0-c2c8-a2c6-5b1f-f3145dc2c8e5@redhat.com
[snitzer: rebase ontop of commit 0a9bab391e reduced this commit's changes]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-02 10:30:36 -05:00
Christoph Hellwig
8e0ef41286 dm: use queue_limits_set
Use queue_limits_set which validates the limits and takes care of
updating the readahead settings instead of directly assigning them to
the queue.  For that make sure all limits are actually updated before
the assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 08:54:42 -07:00
Mike Snitzer
6a87a8a258 dm vdo thread-device: rename all methods to reflect vdo-only use
Also moved vdo_init()'s call to vdo_initialize_thread_device_registry
next to other registry initialization.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:26:24 -05:00
Mike Snitzer
82b354ffe2 dm vdo thread-registry: rename all methods to reflect vdo-only use
Otherwise, uds_ prefix is misleading (vdo_ is the new catch-all for
code that is used by vdo-only or _both_ vdo and the indexer code).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:26:20 -05:00
Mike Snitzer
cb6f8b7500 dm vdo thread-utils: cleanup included headers
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:26:11 -05:00
Mike Snitzer
650e3107bc dm vdo thread-utils: further cleanup of thread functions
Change thread function prefix from "uds_" to "vdo_" and fix
vdo_join_threads() to return void.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:26:07 -05:00
Mike Snitzer
fe6e4ccbe8 dm vdo thread-utils: remove all uds_*_mutex wrappers
Just use mutex_init, mutex_lock and mutex_unlock.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:26:03 -05:00
Mike Snitzer
7f2e494ddd dm vdo thread-utils: push uds_*_cond interface down to indexer
Only used by indexer components. Also return void from
uds_init_cond(), remove uds_destroy_cond(), and fix up
all callers.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:58 -05:00
Mike Snitzer
877f36b764 dm vdo: fold thread-cond-var.c into thread-utils
Further cleanup is needed for thread-utils interfaces given many
functions should return void or be removed entirely because they
amount to obfuscation via wrappers.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:54 -05:00
Mike Snitzer
8e6333af19 dm vdo indexer: rename uds.h to indexer.h
Also remove unnecessary include from funnel-queue.c.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:49 -05:00
Mike Snitzer
c2f54aa2b2 dm vdo: rename uds-threads.[ch] to thread-utils.[ch]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:45 -05:00
Mike Snitzer
eef7cf5e22 dm vdo indexer sparse-cache: cleanup threads_barrier code
Rename 'barrier' to 'threads_barrier', remove useless
uds_destroy_barrier(), return void from remaining methods and
clean up uds_make_sparse_cache() accordingly.

Also remove uds_ prefix from the 2 remaining threads_barrier
functions.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:41 -05:00
Mike Snitzer
0593855a83 dm vdo uds-threads: push 'barrier' down to sparse-cache
The sparse-cache is the only user of the 'barrier' data structure,
so just move it private to it.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:36 -05:00
Mike Snitzer
2d98aa1780 dm vdo uds-threads: eliminate uds_*_semaphore interfaces
The implementation of thread 'barrier' data structure does not require
overdone private semaphore wrappers.  Also rename the barrier
structure's 'mutex' member (a semaphore) to 'lock'.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:32 -05:00
Mike Snitzer
9d87418945 dm vdo: make uds_*_semaphore interface private to uds-threads.c
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:23 -05:00
Mike Snitzer
50944062f7 dm vdo block-map: rename page state name from "UDS_FREE" to "FREE"
Only used for log message, but no need for "UDS_" prefix.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
2024-03-01 09:25:16 -05:00
Harshit Mogalapalli
f304f6b443 dm vdo volume-index: fix an assert statement in start_restoring_volume_sub_index()
Use "==" instead of "=" in ASSERT() statement.

Fixes: ef074a31e88e ("dm vdo: implement the volume index")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-01 09:25:09 -05:00
Yu Kuai
0091c5a269 md/raid1: factor out helpers to choose the best rdev from read_balance()
The way that best rdev is chosen:

1) If the read is sequential from one rdev:
 - if rdev is rotational, use this rdev;
 - if rdev is non-rotational, use this rdev until total read length
   exceed disk opt io size;

2) If the read is not sequential:
 - if there is idle disk, use it, otherwise:
 - if the array has non-rotational disk, choose the rdev with minimal
   inflight IO;
 - if all the underlaying disks are rotational disk, choose the rdev
   with closest IO;

There are no functional changes, just to make code cleaner and prepare
for following refactor.

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com
2024-02-29 22:49:46 -08:00
Yu Kuai
ba58f57fdf md/raid1: factor out the code to manage sequential IO
There is no functional change for now, make read_balance() cleaner and
prepare to fix problems and refactor the handler of sequential IO.

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com
2024-02-29 22:49:46 -08:00
Yu Kuai
9f3ced7922 md/raid1: factor out choose_bb_rdev() from read_balance()
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the rdev with bad blocks from
read_balance(), there are no functional changes.

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com
2024-02-29 22:49:46 -08:00