Commit Graph

1215433 Commits

Author SHA1 Message Date
Denis Plotnikov
1bbe254e43 md-cluster: check for timeout while a new disk adding
A new disk adding may end up with timeout and a new disk won't be added.
Add returning the error in that case.

Found by Linux Verification Center (linuxtesting.org) with SVACE

Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru
2023-10-12 09:16:19 -07:00
Song Liu
9164e4a5af Merge branch 'md-suspend-rewrite' into md-next
From Yu Kuai, written by Song Liu

Recent tests with raid10 revealed many issues with the following scenarios:

- add or remove disks to the array
- issue io to the array

At first, we fixed each problem independently respect that io can
concurrent with array reconfiguration. However, with more issues reported
continuously, I am hoping to fix these problems thoroughly.

Refer to how block layer protect io with queue reconfiguration (for
example, change elevator):

blk_mq_freeze_queue
-> wait for all io to be done, and prevent new io to be dispatched
// reconfiguration
blk_mq_unfreeze_queue

I think we can do something similar to synchronize io with array
reconfiguration.

Current synchronization works as the following. For the reconfiguration
operation:

1. Hold 'reconfig_mutex';
2. Check that rdev can be added/removed, one condition is that there is no
   IO (for example, check nr_pending).
3. Do the actual operations to add/remove a rdev, one procedure is
   set/clear a pointer to rdev.
4. Check if there is still no IO on this rdev, if not, revert the
   change.

IO path uses rcu_read_lock/unlock() to access rdev.

- rcu is used wrongly;
- There are lots of places involved that old rdev can be read, however,
many places doesn't handle old value correctly;
- Between step 3 and 4, if new io is dispatched, NULL will be read for
the rdev, and data will be lost if step 4 failed.

The new synchronization is similar to blk_mq_freeze_queue(). To add or
remove disk:

1. Suspend the array, that is, stop new IO from being dispatched
   and wait for inflight IO to finish.
2. Add or remove rdevs to array;
3. Resume the array;

IO path doesn't need to change for now, and all rcu implementation can
be removed.

Then main work is divided into 3 steps:

First, first make sure new apis to suspend the array is general:

- make sure suspend array will wait for io to be done(Done by [1]);
- make sure suspend array can be called for all personalities(Done by [2]);
- make sure suspend array can be called at any time(Done by [3]);
- make sure suspend array doesn't rely on 'reconfig_mutex'(PATCH 3-5);

Second replace old apis with new apis(PATCH 6-16). Specifically, the
synchronization is changed from:

  lock reconfig_mutex
  suspend array
  make changes
  resume array
  unlock reconfig_mutex

to:
   suspend array
   lock reconfig_mutex
   make changes
   unlock reconfig_mutex
   resume array

Finally, for the remain path that involved reconfiguration, suspend the
array first(PATCH 11,12, [4] and PATCH 17):

Preparatory work:
[1] https://lore.kernel.org/all/20230621165110.1498313-1-yukuai1@huaweicloud.com/
[2] https://lore.kernel.org/all/20230628012931.88911-2-yukuai1@huaweicloud.com/
[3] https://lore.kernel.org/all/20230825030956.1527023-1-yukuai1@huaweicloud.com/
[4] https://lore.kernel.org/all/20230825031622.1530464-1-yukuai1@huaweicloud.com/

* md-suspend-rewrite:
  md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
  md: remove old apis to suspend the array
  md: suspend array in md_start_sync() if array need reconfiguration
  md/raid5: replace suspend with quiesce() callback
  md/md-linear: cleanup linear_add()
  md: cleanup mddev_create/destroy_serial_pool()
  md: use new apis to suspend array before mddev_create/destroy_serial_pool
  md: use new apis to suspend array for ioctls involed array reconfiguration
  md: use new apis to suspend array for adding/removing rdev from state_store()
  md: use new apis to suspend array for sysfs apis
  md/raid5: use new apis to suspend array
  md/raid5-cache: use new apis to suspend array
  md/md-bitmap: use new apis to suspend array for location_store()
  md/dm-raid: use new apis to suspend array
  md: add new helpers to suspend/resume and lock/unlock array
  md: add new helpers to suspend/resume array
  md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
  md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
  md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
2023-10-10 19:23:32 -07:00
Yu Kuai
2b16a52549 md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
Now that the old apis are removed, __mddev_suspend/resume() can be
renamed to their original names.

This is done by:

sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch]
sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
4717c02875 md: remove old apis to suspend the array
Now that mddev_suspend() and mddev_resume() is not used anywhere, remove
them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
bc08041b32 md: suspend array in md_start_sync() if array need reconfiguration
So that io won't concurrent with array reconfiguration, and it's safe to
suspend the array directly because normal io won't rely on
md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-18-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
b42cd7b3a2 md/raid5: replace suspend with quiesce() callback
raid5 is the only personality to suspend array in check_reshape() and
start_reshape() callback, suspend and quiesce() callback can both wait
for all normal io to be done, and prevent new io to be dispatched, the
difference is that suspend is implemented in common layer, and quiesce()
callback is implemented in raid5.

In order to cleanup all the usage of mddev_suspend(), the new apis
__mddev_suspend() need to be called before 'reconfig_mutex' is held,
and it's not good to affect all the personalities in common layer just
for raid5. Hence replace suspend with quiesce() callaback, prepare to
reomove all the users of mddev_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-17-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
1978c742f3 md/md-linear: cleanup linear_add()
Now that caller already suspend the array, there is no need to suspend
array in liner_add().

Note that mddev_suspend/resume() is not used anymore.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-16-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
b4128c00a6 md: cleanup mddev_create/destroy_serial_pool()
Now that except for stopping the array, all the callers already suspend
the array, there is no need to suspend anymore, hence remove the second
parameter.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com
2023-10-10 18:49:51 -07:00
Yu Kuai
58226942ad md: use new apis to suspend array before mddev_create/destroy_serial_pool
mddev_create/destroy_serial_pool() will be called from several places
where mddev_suspend() will be called later.

Prepare to remove the mddev_suspend() from
mddev_create/destroy_serial_pool().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
1b0a2d950e md: use new apis to suspend array for ioctls involed array reconfiguration
'reconfig_mutex' will be grabbed before these ioctls, suspend array
before holding the lock, so that io won't concurrent with array
reconfiguration through ioctls.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-13-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
cfa078c8b8 md: use new apis to suspend array for adding/removing rdev from state_store()
User can write 'remove' and 're-add' to trigger array reconfiguration
through sysfs, suspend array in this case so that io won't concurrent
with array reconfiguration.

And now that all the caller of add_bound_rdev() alread suspend the
array, remove mddev_suspend/resume() from add_bound_rdev() as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
205669f377 md: use new apis to suspend array for sysfs apis
Convert to use new apis in following sysfs apis:
 - level_store
 - suspend_lo_store
 - suspend_hi_store
 - serialize_policy_store

These are not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-11-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
e28ca92fbb md/raid5: use new apis to suspend array
Convert to use new apis, the old apis will be removed eventually.

These are not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-10-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
1b172e0b11 md/raid5-cache: use new apis to suspend array
Convert to use new apis, the old apis will be removed eventually.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-9-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
3cddf86a57 md/md-bitmap: use new apis to suspend array for location_store()
Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-8-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
4eb3327aa2 md/dm-raid: use new apis to suspend array
Convert to use new apis, the old apis will be removed eventually.

These are not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-7-yukuai1@huaweicloud.com
2023-10-10 18:49:50 -07:00
Yu Kuai
f45461e24f md: add new helpers to suspend/resume and lock/unlock array
The new helpers suspend the array first and then lock the array,

Prepare to refactor from:

mddev_lock/lock_nointr
mddev_suspend
...
mddev_resuem
mddev_lock

With:

mddev_suspend_and_lock/lock_nointr
...
mddev_unlock_and_resume

After all the use cases is refactored, mddev_suspend/resume() will be
removed.

And mddev_suspend_and_lock() will also replace mddev_lock() for the case
that the array will be reconfigured, in order to synchronize with io to
prevent problems in many corner cases.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com
2023-10-10 18:49:49 -07:00
Yu Kuai
714d20150e md: add new helpers to suspend/resume array
Advantages for new apis:
 - reconfig_mutex is not required;
 - the weird logical that suspend array hold 'reconfig_mutex' for
   mddev_check_recovery() to update superblock is not needed;
 - the specail handling, 'pers->prepare_suspend', for raid456 is not
   needed;
 - It's safe to be called at any time once mddev is allocated, and it's
   designed to be used from slow path where array configuration is changed;
 - the new helpers is designed to be called before mddev_lock(), hence
   it support to be interrupted by user as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com
2023-10-10 18:49:49 -07:00
Yu Kuai
2e82248b70 md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
Prepare to cleanup pers->prepare_suspend(), which is used to fix a
deadlock in raid456 by returning error for io that is waiting for
reshape to make progress in mddev_suspend().

This change will allow reshape to make progress while waiting for io to
be done in mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com
2023-10-10 18:49:49 -07:00
Yu Kuai
06a4d0d8c6 md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
'conf->log' is set with 'reconfig_mutex' grabbed, however, readers are
not procted, hence protect it with READ_ONCE/WRITE_ONCE to prevent
reading abnormal values.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-3-yukuai1@huaweicloud.com
2023-10-10 18:49:49 -07:00
Yu Kuai
617787f138 md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
Protect 'suspend_lo' and 'suspend_hi' with READ_ONCE/WRITE_ONCE to prevent
reading abnormal values.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231010151958.145896-2-yukuai1@huaweicloud.com
2023-10-10 18:49:49 -07:00
Yu Kuai
9e55a22fce md/raid1: don't split discard io for write behind
Currently, discad io is treated the same as normal write io, and for
write behind case, io size is limited to:

BIO_MAX_VECS * (PAGE_SIZE >> 9)

For 0.5KB sector size and 4KB PAGE_SIZE, this is just 1MB. For
consequence, if 'WriteMostly' is set to one of the underlying disks,
then diskcard io will be splited into 1MB and it will take a long time
for the diskcard to finish.

Fix this problem by disable write behind for discard io.

Reported-by: Roman Mamedov <rm@romanrm.net>
Closes: https://lore.kernel.org/all/6a1165f7-c792-c054-b8f0-1ad4f7b8ae01@ultracoder.org/
Reported-and-tested-by: Kirill Kirilenko <kirill@ultracoder.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231007112105.407449-1-yukuai1@huaweicloud.com
2023-10-09 16:32:02 -07:00
Mariusz Tkaczyk
09f894affc md: do not require mddev_lock() for all options in array_state_store()
We don't need to lock device to reject not supported request
in array_state_store(). No functional changes intended.

There are differences between ioctl and sysfs handling during stopping.
With this change, it will be easier to add additional steps which needs
to be completed before mddev_lock() is taken.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230928125517.12356-1-mariusz.tkaczyk@linux.intel.com
2023-09-28 13:34:57 -07:00
Yu Kuai
cf1b6d4441 md: simplify md_seq_ops
Before this patch, the implementation is hacky and hard to understand:

1) md_seq_start set pos to 1;
2) md_seq_show found pos is 1, then print Personalities;
3) md_seq_next found pos is 1, then it update pos to the first mddev;
4) md_seq_show found pos is not 1 or 2, show mddev;
5) md_seq_next found pos is not 1 or 2, update pos to next mddev;
6) loop 4-5 until the last mddev, then md_seq_next update pos to 2;
7) md_seq_show found pos is 2, then print unused devices;
8) md_seq_next found pos is 2, stop;

This patch remove the magic value and use seq_list_start/next/stop()
directly, and move printing "Personalities" to md_seq_start(),
"unsed devices" to md_seq_stop():

1) md_seq_start print Personalities, and then set pos to first mddev;
2) md_seq_show show mddev;
3) md_seq_next update pos to next mddev;
4) loop 2-3 until the last mddev;
5) md_seq_stop print unsed devices;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com
2023-09-27 13:54:26 -07:00
Yu Kuai
3d8d32873c md: factor out a helper from mddev_put()
There are no functional changes, prepare to simplify md_seq_ops in next
patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com
2023-09-27 13:54:26 -07:00
Justin Stitt
ceb0416383 md: replace deprecated strncpy with memcpy
`strncpy` is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

There are three such strncpy uses that this patch addresses:

The respective destination buffers are:
1) mddev->clevel
2) clevel
3) mddev->metadata_type

We expect mddev->clevel to be NUL-terminated due to its use with format
strings:
|       ret = sprintf(page, "%s\n", mddev->clevel);

Furthermore, we can see that mddev->clevel is not expected to be
NUL-padded as `md_clean()` merely set's its first byte to NULL -- not
the entire buffer:
|       static void md_clean(struct mddev *mddev)
|       {
|       	mddev->array_sectors = 0;
|       	mddev->external_size = 0;
|               ...
|       	mddev->level = LEVEL_NONE;
|       	mddev->clevel[0] = 0;
|               ...

A suitable replacement for this instance is `memcpy` as we know the
number of bytes to copy and perform manual NUL-termination at a
specified offset. This really decays to just a byte copy from one buffer
to another. `strscpy` is also a considerable replacement but using
`slen` as the length argument would result in truncation of the last
byte unless something like `slen + 1` was provided which isn't the most
idiomatic strscpy usage.

For the next case, the destination buffer `clevel` is expected to be
NUL-terminated based on its usage within kstrtol() which expects
NUL-terminated strings. Note that, in context, this code removes a
trailing newline which is seemingly not required as kstrtol() can handle
trailing newlines implicitly. However, there exists further usage of
clevel (or buf) that would also like to have the newline removed. All in
all, with similar reasoning to the first case, let's just use memcpy as
this is just a byte copy and NUL-termination is handled manually.

The third and final case concerning `mddev->metadata_type` is more or
less the same as the other two. We expect that it be NUL-terminated
based on its usage with seq_printf:
|       seq_printf(seq, " super external:%s",
|       	   mddev->metadata_type);
... and we can surmise that NUL-padding isn't required either due to how
it is handled in md_clean():
|       static void md_clean(struct mddev *mddev)
|       {
|       ...
|       mddev->metadata_type[0] = 0;
|       ...

So really, all these instances have precisely calculated lengths and
purposeful NUL-termination so we can just use memcpy to remove ambiguity
surrounding strncpy.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b@google.com
2023-09-25 14:36:41 -07:00
Kees Cook
e887544d76 md/md-linear: Annotate struct linear_conf with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct linear_conf.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230915200328.never.064-kees@kernel.org
2023-09-22 11:23:01 -07:00
Yu Kuai
54d21eb6ad md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store()
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.

This will also allow the array to be suspended even before the array
is ran.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com
2023-09-22 10:28:27 -07:00
Yu Kuai
a2a9f16838 md: don't check 'mddev->pers' from suspend_hi_store()
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.

This will also allow the array to be suspended even before the array
is ran.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com
2023-09-22 10:28:27 -07:00
Yu Kuai
158d32af87 md-bitmap: suspend array earlier in location_store()
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to call mddev_suspend() earlier.

This will also be helper to refactor mddev_suspend() later.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com
2023-09-22 10:28:27 -07:00
Yu Kuai
b71fe4ac75 md-bitmap: remove the checking of 'pers->quiesce' from location_store()
After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"),
there is no need to check 'pers->quiesce' before calling
mddev_suspend().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com
2023-09-22 10:28:27 -07:00
Yu Kuai
b721e7885e md: don't rely on 'mddev->pers' to be set in mddev_suspend()
'active_io' used to be initialized while the array is running, and
'mddev->pers' is set while the array is running as well. Hence caller
must hold 'reconfig_mutex' and guarantee 'mddev->pers' is set before
calling mddev_suspend().

Now that 'active_io' is initialized when mddev is allocated, such
restriction doesn't exist anymore. In the meantime, follow up patches
will refactor mddev_suspend(), hence add checking for 'mddev->pers' to
prevent null-ptr-deref.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-4-yukuai1@huaweicloud.com
2023-09-22 10:28:26 -07:00
Yu Kuai
b8494823e2 md: initialize 'writes_pending' while allocating mddev
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.

On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:

array_state_store
 set_in_sync
  if (!mddev->in_sync) -> in_sync is used for all levels
   // access writes_pending

There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.

By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-09-22 10:28:26 -07:00
Yu Kuai
d58eff83bd md: initialize 'active_io' while allocating mddev
'active_io' is used for mddev_suspend() and it's initialized in
md_run(), this restrict that 'reconfig_mutex' must be held and
"mddev->pers" must be set before calling mddev_suspend().

Initialize 'active_io' early so that mddev_suspend() is safe to call
once mddev is allocated, this will be helpful to refactor
mddev_suspend() in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com
2023-09-22 10:28:26 -07:00
Yu Kuai
81e2ce1b3d md: delay remove_and_add_spares() for read only array to md_start_sync()
Before this patch, for read-only array:

md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.

After this patch:

1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
   worker 'sync_work' is not pending, and there are rdevs can be added
   or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;

This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.

Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:

1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
   recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
   spares to the array.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-09-22 10:28:26 -07:00
Yu Kuai
a0ae7e4e0b md: factor out a helper rdev_addable() from remove_and_add_spares()
There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-7-yukuai1@huaweicloud.com
2023-09-22 10:28:25 -07:00
Yu Kuai
b172a0704d md: factor out a helper rdev_is_spare() from remove_and_add_spares()
There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com
2023-09-22 10:28:25 -07:00
Yu Kuai
3389d57f97 md: factor out a helper rdev_removeable() from remove_and_add_spares()
There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-5-yukuai1@huaweicloud.com
2023-09-22 10:28:25 -07:00
Yu Kuai
db5e653d7c md: delay choosing sync action to md_start_sync()
Before this patch, for read-write array:

1) md_check_recover() found that something need to be done, and it'll
   try to grab 'reconfig_mutex'. The case that md_check_recover() need
   to do something:
   - array is not suspend;
   - super_block need to be updated;
   - 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
   - unusual case related to safemode;

2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
   md_check_recover() will try to choose a sync action, and then queue a
   work md_start_sync().

3) md_start_sync() register sync_thread;

After this patch,

1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
   queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
   sync_thread();

Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.

Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().

The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-09-22 10:28:25 -07:00
Yu Kuai
897c62a1ca md: factor out a helper to choose sync action from md_check_recovery()
There are no functional changes, on the one hand make the code cleaner,
on the other hand prevent following checkpatch error in the next patch to
delay choosing sync action to md_start_sync().

ERROR: do not use assignment in if condition
+       } else if ((spares = remove_and_add_spares(mddev, NULL))) {

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com
2023-09-22 10:28:25 -07:00
Yu Kuai
ac61978196 md: use separate work_struct for md_start_sync()
It's a little weird to borrow 'del_work' for md_start_sync(), declare
a new work_struct 'sync_work' for md_start_sync().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com
2023-09-22 10:28:24 -07:00
Chengming Zhou
d78bfa1346 block/null_blk: add queue_rqs() support
Add batched mq_ops.queue_rqs() support in null_blk for testing. The
implementation is much easy since null_blk doesn't have commit_rqs().

We simply handle each request one by one, if errors are encountered,
leave them in the passed in list and return back.

There is about 3.6% improvement in IOPS of fio/t/io_uring on null_blk
with hw_queue_depth=256 on my test VM, from 1.09M to 1.13M.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-6-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
217b613a53 blk-mq: update driver tags request table when start request
Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.

Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
434097ee37 blk-mq: support batched queue_rqs() on shared tags queue
Since active requests have been accounted when allocate driver tags,
we can remove this limit now.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
48554df6bf blk-mq: remove RQF_MQ_INFLIGHT
Since the previous patch change to only account active requests when
we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed
and no double account problem.

1. none elevator: flush request will use the first pending request's
   driver tag, won't double account.

2. other elevator: flush request will be accounted when allocate driver
   tag when issue, and will be unaccounted when it put the driver tag.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
b8643d6826 blk-mq: account active requests when get driver tag
There is a limit that batched queue_rqs() can't work on shared tags
queue, since the account of active requests can't be done there.

Now we account the active requests only in blk_mq_get_driver_tag(),
which is not the time we get driver tag actually (with none elevator).

To support batched queue_rqs() on shared tags queue, we move the
account of active requests to where we get the driver tag:

1. none elevator: blk_mq_get_tags() and blk_mq_get_tag()
2. other elevator: __blk_mq_alloc_driver_tag()

This is clearer and match with the unaccount side, which just happen
when we put the driver tag.

The other good point is that we don't need RQF_MQ_INFLIGHT trick
anymore, which used to avoid double account of flush request.
Now we only account when actually get the driver tag, so all is good.
We will remove RQF_MQ_INFLIGHT in the next patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Linus Torvalds
ce9ecca023 Linux 6.6-rc2 2023-09-17 14:40:24 -07:00
Linus Torvalds
e789286468 Misc fixes:
- Fix an UV boot crash,
 - Skip spurious ENDBR generation on _THIS_IP_,
 - Fix ENDBR use in putuser() asm methods,
 - Fix corner case boot crashes on 5-level paging,
 - and fix a false positive WARNING on LTO kernels.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOrwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j6Jw/+PjUfh/4+KM/Z8VzcBy2UhY3NMF2ptGCN
 47FPLy+8ADqOvIfgsPsBEO9VXwdvkHfH64YWRUlULjvPNOSs+37drBYMe9AI9xKE
 u6NhoBHmsnOtoLkBFIQYlJys9GyAeoMlwSSHxzRwQS+3VztRjoH636jiBcg/h7DR
 zhakfnJD1SSOZuEyyDPnO0uIUarrcqC2jdZwucSqZnvZFdA/pexUHQEe2RtMXLln
 EIA5kuEo7UdADcSr/Lbty7MKO+6xpRTjxF0yNItPtwPWsnxrSAC7P+dQ37uB975U
 Z0CJ/kw54XG5Sdoech7XCWYmJzDxSPhiziA1USKpZJMo5phAG/apTJK6NG4TG94r
 U+3QhLopUoSd8N74/VtZn0FjOrMsk7YKD5o8twlTbpCd2VaiJk4X5oBIns6Tvq05
 0Vmsx15XO3mEzg1wWbbdjhjHW0czRgBpikS9QyexZKVkukYcW8QT6bk1gK1ypg94
 do4PHKB53QIt31dedy/ddpQj4u1sJ4+a9/+IG29kjUB33M4+uFedtw11vfe+CDN0
 XLRc6HfPblogIIEO/figJgwSq/TPCOsNHTwKkejq+D1Ey6DsrnX9Gg4BWVz/3dDW
 84SW7TaO2FGEDh4NkR2ijkZlbpofFnCvhCh/uohodPlqDrTVTuRKCZW9I5plmGVa
 qeXUpNDFs1E=
 =BMjT
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

   - Fix an UV boot crash

   - Skip spurious ENDBR generation on _THIS_IP_

   - Fix ENDBR use in putuser() asm methods

   - Fix corner case boot crashes on 5-level paging

   - and fix a false positive WARNING on LTO kernels"

* tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/purgatory: Remove LTO flags
  x86/boot/compressed: Reserve more memory for page tables
  x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*()
  x86/ibt: Suppress spurious ENDBR
  x86/platform/uv: Use alternate source for socket to node data
2023-09-17 11:13:37 -07:00
Linus Torvalds
e5a710d132 Fix a performance regression on large SMT systems, an Intel SMT4
balancing bug, and a topology setup bug on (Intel) hybrid processors.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOVQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iOahAAj3YsoNbT/k6m9yp622n1OopaNEQvsK+/
 F2Q5g/hJrm3+W5764rF8CvjhDbmrv6owjp3yUyZLDIfSAFZYMvwoNody3a373Yr3
 VFBMJ00jNIv/TAFCJZYeybg3yViwObKKfpu4JBj//QU+4uGWCoBMolkVekU2bBti
 r50fMxBPgg2Yic57DCC8Y+JZzHI/ydQ3rvVXMzkrTZCO/zY4/YmERM9d+vp4wl4B
 uG9cfXQ4Yf/1gZo0XDlTUkOJUXPnkMgi+N4eHYlGuyOCoIZOfATI24hRaPBoQcdx
 PDwHcKmyNxH9iaRppNQMvi797g3KrKVEmZwlZg1IfsILhKC0F4GsQ85tw8qQWE8j
 brFPkWVUxAUSl4LXoqVInaxDHmJWR2UC3RA7b+BxFF/GMLTow0z4a+JMC6eKlNyR
 uBisZnuEuecqwF9TLhyD3KBHh7PihUPz8PuFHk+Um5sggaUE82I+VwX6uxEi5y8r
 ke2kAkpuMxPWT5lwDmFPAXWfvpZz5vvTIRUxGGj2+s4d8b0YfLtZsx5+uOIacaub
 Gw+wYFfSowph72tR/SUVq0An/UTSPPBxty8eFIVeE6sW9bw3ghTtkf8300xjV7Rj
 sKVxXy/podAo8wG7R6aZfTfsCpohmeEjskiatYdThYamPPx7V0R5pq4twmTXTHLJ
 bFvQ1GFCOu0=
 =jIeN
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Fix a performance regression on large SMT systems, an Intel SMT4
  balancing bug, and a topology setup bug on (Intel) hybrid processors"

* tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain
  sched/fair: Fix SMT4 group_smt_balance handling
  sched/fair: Optimize should_we_balance() for large SMT systems
2023-09-17 11:10:23 -07:00
Linus Torvalds
e54ca3c81f Fix a cold functions related false-positive objtool warning
that triggers on Clang.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOFwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hQdRAAsekH6786PH2hiL7DL1KhCZMdC1V71ORr
 3YFj1LcG+mXB6nQLt961KgA4l4efGMMxBxhT47wqOm0tJXUOVSzXxi3aQ0eoIPH0
 m5MnSWEyfZRjcvNjS8IZ2N8CJr1AvnSZPJ3iaJD2knNqHOCMORXbrhXnc9ulL3PR
 r1eBaaylLtlhHUdvekUeW8qZBAFx3ZzWz3lf0IY8seBbBPTXVp6dS4PPMzZ5vwTB
 e9yyOiLaF1P5mNZnOBNfEVKTQTmaFECDRp9PhGcTxY0GY4+9apyD5h/aDJwRJyFN
 ciB+zvmxw3mjlhCCG1CllImjz/gvzdwqzxeYlHPyZvEbnuJqCkdBLSgRGwi9vtyw
 APsHYYAHr6CNR/15/PvmX+GGR6No0OkR9BoZL5ygJE5+sapKvyeItymqovRRKGZ/
 kEQK2fj6EiDiy2EejMZ9EFUtWfhkV5OkT0Jd0nd/ZxZi3UbBEfqq6JgSIe/+KzC3
 Iniovn77mpQHP1cM/OGbPByOMUygjNBwigCwo12imxrktud+/HQJ74gX7cBsYKEH
 fKbAbHoLpC7/hqGc/3nzZF7b1pBMf4Lehm6iePsXai6Fv9hO7/T5RH54xJGp4HTO
 EexuFJt/d7l4ymtGtO8i/V65iiVkXsddnBivYfOqisxwB0s5BcMgsh4XrWMGd8Q4
 KP9fcsOtUKM=
 =DIBp
 -----END PGP SIGNATURE-----

Merge tag 'objtool-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fix from Ingo Molnar:
 "Fix a cold functions related false-positive objtool warning that
  triggers on Clang"

* tag 'objtool-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix _THIS_IP_ detection for cold functions
2023-09-17 10:59:37 -07:00