IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The way that best rdev is chosen:
1) If the read is sequential from one rdev:
- if rdev is rotational, use this rdev;
- if rdev is non-rotational, use this rdev until total read length
exceed disk opt io size;
2) If the read is not sequential:
- if there is idle disk, use it, otherwise:
- if the array has non-rotational disk, choose the rdev with minimal
inflight IO;
- if all the underlaying disks are rotational disk, choose the rdev
with closest IO;
There are no functional changes, just to make code cleaner and prepare
for following refactor.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com
If resync is in progress, read_balance() should find the first usable
disk, otherwise, data could be inconsistent after resync is done. raid1
and raid10 implement the same checking, hence factor out the checking
to make code cleaner.
Noted that raid1 is using 'mddev->recovery_cp', which is updated after
all resync IO is done, while raid10 is using 'conf->next_resync', which
is inaccurate because raid10 update it before submitting resync IO.
Fortunately, raid10 read IO can't concurrent with resync IO, hence there
is no problem. And this patch also switch raid10 to use
'mddev->recovery_cp'.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com
Commit 12cee5a8a2 ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():
read_balance:
for_each_rdev
if(next_seq_sect == this_sector || dist == 0)
-> sequential reads
best_disk = disk;
if (...)
choose_next_idle = 1
continue;
for_each_rdev
-> iterate next rdev
if (pending == 0)
best_disk = disk;
-> choose the next idle disk
break;
if (choose_next_idle)
-> keep using this rdev if there are no other idle disk
contine
However, commit 2e52d449bc ("md/raid1: add failfast handling for reads.")
remove the code:
- /* If device is idle, use it */
- if (pending == 0) {
- best_disk = disk;
- break;
- }
Hence choose next idle will never work now, fix this problem by
following:
1) don't set best_disk in this case, read_balance() will choose the best
disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) add a new local variable 'sequential_disk' to record the disk, and if
there is no other idle disk, 'sequential_disk' will be chosen;
Fixes: 2e52d449bc ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com
For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.
Disk nonrot info can be changed through sysfs entry:
/sys/block/[disk_name]/queue/rotational
However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com
The current api is_badblock() must pass in 'first_bad' and
'bad_sectors', however, many caller just want to know if there are
badblocks or not, and these caller must define two local variable that
will never be used.
Add a new helper rdev_has_badblock() that will only return if there are
badblocks or not, remove unnecessary local variables and replace
is_badblock() with the new helper in many places.
There are no functional changes, and the new helper will also be used
later to refactor read_balance().
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com
RCU protection was removed in the commit 2d32777d60 ("raid1: remove rcu
protection to access rdev from conf").
However, the code in fix_read_error does rcu_dereference outside
rcu_read_lock - this triggers the following warning. The warning is
triggered by a LVM2 test shell/integrity-caching.sh.
This commit removes rcu_dereference.
=============================
WARNING: suspicious RCU usage
6.7.0 #2 Not tainted
-----------------------------
drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by mdX_raid1/1859.
stack backtrace:
CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x70
lockdep_rcu_suspicious+0x153/0x1b0
raid1d+0x1732/0x1750 [raid1]
? lock_acquire+0x9f/0x270
? finish_wait+0x3d/0x80
? md_thread+0xf7/0x130 [md_mod]
? lock_release+0xaa/0x230
? md_register_thread+0xd0/0xd0 [md_mod]
md_thread+0xa0/0x130 [md_mod]
? housekeeping_test_cpu+0x30/0x30
kthread+0xdc/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x28/0x40
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: ca294b34aa ("md/raid1: support read error check")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/51539879-e1ca-fde3-b8b4-8934ddedcbc@redhat.com
Because it's safe to accees rdev from conf:
- If any spinlock is held, because synchronize_rcu() from
md_kick_rdev_from_array() will prevent 'rdev' to be freed until
spinlock is released;
- If 'reconfig_lock' is held, because rdev can't be added or removed from
array;
- If there is normal IO inflight, because mddev_suspend() will prevent
rdev to be added or removed from array;
- If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is
checked in remove_and_add_spares().
And these will cover all the scenarios in raid1.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com
rcu is not used correctly here, because synchronize_rcu() is called
before replacing old value, for example:
remove_and_add_spares // other path
synchronize_rcu
// called before replacing old value
set_bit(RemoveSynchronized)
rcu_read_lock()
rdev = conf->mirros[].rdev
pers->hot_remove_disk
conf->mirros[].rdev = NULL;
if (!test_bit(RemoveSynchronized))
synchronize_rcu
/*
* won't be called, and won't wait
* for concurrent readers to be done.
*/
// access rdev after remove_and_add_spares()
rcu_read_unlock()
Fortunately, there is a separate rcu protection to prevent such rdev
to be freed:
md_kick_rdev_from_array //other path
rcu_read_lock()
rdev = conf->mirros[].rdev
list_del_rcu(&rdev->same_set)
rcu_read_unlock()
/*
* rdev can be removed from conf, but
* rdev won't be freed.
*/
synchronize_rcu()
free rdev
Hence remove this useless flag and prepare to remove rcu protection to
access rdev from 'conf'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:
array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending
There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
There is a compile error when this commit is added:
md: raid1: fix potential OOB in raid1_remove_disk()
drivers/md/raid1.c: In function 'raid1_remove_disk':
drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations
and code [-Werror=declaration-after-statement]
1844 | struct raid1_info *p = conf->mirrors + number;
| ^~~~~~
That's because the new code was inserted before the struct.
The change is move the struct command above this commit.
Fixes: 8b0472b50b ("md: raid1: fix potential OOB in raid1_remove_disk()")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com
wake_up is called unconditionally in a few paths such as make_request(),
which cause lock contention under high concurrency workload like below
raid1_end_write_request
wake_up
__wake_up_common_lock
spin_lock_irqsave
Improve performance by only call wake_up() if waitqueue is not empty
Fio test script:
[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]
Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server.
Before this patch With this patch
READ BW=4621MB/s BW=7337MB/s
WRITE BW=1980MB/s BW=3144MB/s
The patch is inspired by Yu Kuai's change for raid10:
https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com
Signed-off-by: Song Liu <song@kernel.org>
Two problems can be fixed this way:
1) 'active_io' will represent inflight io instead of io that is
dispatching.
2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:
Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:
[root@fedora ~]# cat /sys/block/md0/inflight
0 4474191
Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a831 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.
On the one hand, the deadlock won't exist after commit a214b949d8
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.
In order to limit the number of plugged bio, commit 874807a831
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:
1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().
Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.
Fix this problem by protecting all md_thread with rcu.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
Use the %pg format specifier to save on stack consumption and code size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.
Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.
There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
process cannot proceed safe.
-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.
The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.
As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.
[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
RAID1/RAID10.")
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>