IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 1727fd5015d8f93474148f94e34cda5aa6ad4a43 upstream.
Current code produces a warning as shown below when total characters
in the constituent block device names plus the slashes exceeds 200.
snprintf() returns the number of characters generated from the given
input, which could cause the expression “200 – len” to wrap around
to a large positive number. Fix this by using scnprintf() instead,
which returns the actual number of characters written into the buffer.
[ 1513.267938] ------------[ cut here ]------------
[ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510
[ 1513.267944] Modules linked in: <snip>
[ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu
[ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510
<-snip->
[ 1513.267982] Call Trace:
[ 1513.267986] snprintf+0x45/0x70
[ 1513.267990] ? disk_name+0x71/0xa0
[ 1513.267993] dump_zones+0x114/0x240 [raid0]
[ 1513.267996] ? _cond_resched+0x19/0x40
[ 1513.267998] raid0_run+0x19e/0x270 [raid0]
[ 1513.268000] md_run+0x5e0/0xc50
[ 1513.268003] ? security_capable+0x3f/0x60
[ 1513.268005] do_md_run+0x19/0x110
[ 1513.268006] md_ioctl+0x195e/0x1f90
[ 1513.268007] blkdev_ioctl+0x91f/0x9f0
[ 1513.268010] block_ioctl+0x3d/0x50
[ 1513.268012] do_vfs_ioctl+0xa9/0x640
[ 1513.268014] ? __fput+0x162/0x260
[ 1513.268016] ksys_ioctl+0x75/0x80
[ 1513.268017] __x64_sys_ioctl+0x1a/0x20
[ 1513.268019] do_syscall_64+0x5e/0x200
[ 1513.268021] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 766038846e875 ("md/raid0: replace printk() with pr_*()")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74 ]
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.
When the deadlock occurs, multiple threads will hang in different ways:
1) The group threads will hang in the blk-wbt code with bios waiting to
be submitted to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
ops_run_io+0x46b/0x1a30
handle_stripe+0xcd3/0x36b0
handle_active_stripes.constprop.0+0x6f6/0xa60
raid5_do_work+0x177/0x330
Or:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
flush_deferred_bios+0x136/0x170
raid5_do_work+0x262/0x330
2) The r5l_reclaim thread will hang in the same way, submitting a
bio to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
submit_bio+0x3f/0xf0
md_super_write+0x12f/0x1b0
md_update_sb.part.0+0x7c6/0xff0
md_update_sb+0x30/0x60
r5l_do_reclaim+0x4f9/0x5e0
r5l_reclaim_thread+0x69/0x30b
However, before hanging, the MD_SB_CHANGE_PENDING flag will be
set for sb_flags in r5l_write_super_and_discard_space(). This
flag will never be cleared because the submit_bio() call never
returns.
3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
will do no processing on any pending stripes and re-set
STRIPE_HANDLE. This will cause the raid5d thread to enter an
infinite loop, constantly trying to handle the same stripes
stuck in the queue.
The raid5d thread has a blk_plug that holds a number of bios
that are also stuck waiting seeing the thread is in a loop
that never schedules. These bios have been accounted for by
blk-wbt thus preventing the other threads above from
continuing when they try to submit bios. --Deadlock.
To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.
However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
from the same thread and can thus deadlock if the thread is put to
sleep. So avoid waiting if md_check_recovery() is being called in the
loop.
It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:
16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
is updated.")
Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d2d05b88035d2d51a5bb6c5afec88a0880c73df4 ]
Inside set_at_max_writeback_rate() the calculation in following if()
check is wrong,
if (atomic_inc_return(&c->idle_counter) <
atomic_read(&c->attached_dev_nr) * 6)
Because each attached backing device has its own writeback thread
running and increasing c->idle_counter, the counter increates much
faster than expected. The correct calculation should be,
(counter / dev_nr) < dev_nr * 6
which equals to,
counter < dev_nr * dev_nr * 6
This patch fixes the above mistake with correct calculation, and helper
routine idle_counter_exceeded() is added to make code be more clear.
Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2eed85bc75138a9eeb63863d20f8904ac42a577 ]
When doing degrade/recover tests using the journal a kernel BUG
is hit at drivers/md/raid5.c:4381 in handle_parity_checks5():
BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
This was found to occur because handle_stripe_fill() was skipped
for stripes in the journal due to a condition in that function.
Thus blocks were not fetched and R5_UPTODATE was not set when
the code reached handle_parity_checks5().
To fix this, don't skip handle_stripe_fill() unless the stripe is
for read.
Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path")
Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0dd84b319352bb8ba64752d4e45396d8b13e6018 upstream.
From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.
Let's stop write first in destructor to align with normal md-raid to
fix the KASAN issue.
[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 104212471b1c1817b311771d817fb692af983173 ]
In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and
may cause sh to be released. However, sh is subsequently used in lines
2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an
use-after-free bug.
It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of
the function.
Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ca7dc242e358e46d963b32f9d9dd829785a9e957 upstream.
dm-writecache has the capability to limit the number of writeback jobs
in progress. However, this feature was off by default. As such there
were some out-of-memory crashes observed when lowering the low
watermark while the cache is full.
This commit enables writeback limit by default. It is set to 256MiB or
1/16 of total system memory, whichever is smaller.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3534e5a5ed2997ca1b00f44a0378a075bd05e8a3 upstream.
Fault inject on pool metadata device reports:
BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80
Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950
CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_address_description.constprop.0.cold+0xeb/0x3f4
kasan_report.cold+0xe6/0x147
dm_pool_register_metadata_threshold+0x40/0x80
pool_ctr+0xa0a/0x1150
dm_table_add_target+0x2c8/0x640
table_load+0x1fd/0x430
ctl_ioctl+0x2c4/0x5a0
dm_ctl_ioctl+0xa/0x10
__x64_sys_ioctl+0xb3/0xd0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This can be easily reproduced using:
echo offline > /sys/block/sda/device/state
dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10
dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0"
If a metadata commit fails, the transaction will be aborted and the
metadata space maps will be destroyed. If a DM table reload then
happens for this failed thin-pool, a use-after-free will occur in
dm_sm_register_threshold_callback (called from
dm_pool_register_metadata_threshold).
Fix this by in dm_pool_register_metadata_threshold() by returning the
-EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr()
with a new error message: "Error registering metadata threshold".
Fixes: ac8c3f3df65e4 ("dm thin: generate event when metadata threshold passed")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1fbeea217d8f297fe0e0956a1516d14ba97d0396 upstream.
There is this warning when using a kernel with the address sanitizer
and running this testsuite:
https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid
==================================================================
BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x9c
print_address_description.constprop.0+0x1f/0x1e0
print_report.cold+0x55/0x244
kasan_report+0xc9/0x100
raid_status+0x1747/0x2820 [dm_raid]
dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
table_load+0x35c/0x630 [dm_mod]
ctl_ioctl+0x411/0x630 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
__x64_sys_ioctl+0x12a/0x1a0
do_syscall_64+0x5b/0x80
The warning is caused by reading conf->max_nr_stripes in raid_status. The
code in raid_status reads mddev->private, casts it to struct r5conf and
reads the entry max_nr_stripes.
However, if we have different raid type than 4/5/6, mddev->private
doesn't point to struct r5conf; it may point to struct r0conf, struct
r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
of these structs to struct r5conf, we will be reading invalid memory
and KASAN warns about it.
Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7dad24db59d2d2803576f2e3645728866a056dab upstream.
There is a KASAN warning in raid_resume when running the lvm test
lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
is greater than rs->raid_disks, so the loop touches one entry beyond
the allocated length.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e120a5f1e78fab6223544e425015f393d90d6f0d ]
Otherwise PR ops may be issued while the broader DM device is being
reconfigured, etc.
Fixes: 9c72bad1f31a ("dm: call PR reserve/unreserve on each underlying device")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d17f744e883b2f8d13cca252d71cfe8ace346f7d upstream.
There's a KASAN warning in raid10_remove_disk when running the lvm
test lvconvert-raid-reshape.sh. We fix this warning by verifying that the
value "number" is valid.
BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10]
Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682
CPU: 3 PID: 124682 Comm: mdX_raid10 Not tainted 5.19.0-rc6 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x45/0x57a
? __lock_text_start+0x18/0x18
? raid10_remove_disk+0x61/0x2a0 [raid10]
kasan_report+0xa8/0xe0
? raid10_remove_disk+0x61/0x2a0 [raid10]
raid10_remove_disk+0x61/0x2a0 [raid10]
Buffer I/O error on dev dm-76, logical block 15344, async page read
? __mutex_unlock_slowpath.constprop.0+0x1e0/0x1e0
remove_and_add_spares+0x367/0x8a0 [md_mod]
? super_written+0x1c0/0x1c0 [md_mod]
? mutex_trylock+0xac/0x120
? _raw_spin_lock+0x72/0xc0
? _raw_spin_lock_bh+0xc0/0xc0
md_check_recovery+0x848/0x960 [md_mod]
raid10d+0xcf/0x3360 [raid10]
? sched_clock_cpu+0x185/0x1a0
? rb_erase+0x4d4/0x620
? var_wake_function+0xe0/0xe0
? psi_group_change+0x411/0x500
? preempt_count_sub+0xf/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? raid10_sync_request+0x36c0/0x36c0 [raid10]
? preempt_count_sub+0xf/0xc0
? _raw_spin_unlock_irqrestore+0x19/0x40
? del_timer_sync+0xa9/0x100
? try_to_del_timer_sync+0xc0/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? _raw_spin_unlock_irq+0x11/0x24
? __list_del_entry_valid+0x68/0xa0
? finish_wait+0xa3/0x100
md_thread+0x161/0x260 [md_mod]
? unregister_md_personality+0xa0/0xa0 [md_mod]
? _raw_spin_lock_irqsave+0x78/0xc0
? prepare_to_wait_event+0x2c0/0x2c0
? unregister_md_personality+0xa0/0xa0 [md_mod]
kthread+0x148/0x180
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 124495:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x80/0xa0
setup_conf+0x140/0x5c0 [raid10]
raid10_run+0x4cd/0x740 [raid10]
md_run+0x6f9/0x1300 [md_mod]
raid_ctr+0x2531/0x4ac0 [dm_raid]
dm_table_add_target+0x2b0/0x620 [dm_mod]
table_load+0x1c8/0x400 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
L __fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
__fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff889108f3d200
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 0 bytes to the right of
256-byte region [ffff889108f3d200, ffff889108f3d300)
The buggy address belongs to the physical page:
page:000000007ef2a34c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1108f3c
head:000000007ef2a34c order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=2)
raw: 4000000000010200 0000000000000000 dead000000000001 ffff889100042b40
raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 617b365872a247480e9dcd50a32c8d1806b21861 upstream.
There's a KASAN warning in raid5_add_disk when running the LVM testsuite.
The warning happens in the test
lvconvert-raid-reshape-linear_to_raid6-single-type.sh. We fix the warning
by verifying that rdev->saved_raid_disk is within limits.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 332bd0778775d0cf105c4b9e03e460b590749916 upstream.
On dm-raid table load (using raid_ctr), dm-raid allocates an array
rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks
is defined by the number of raid metadata and image tupples passed
into the target's constructor.
In the case of RAID layout changes being requested, that number can be
different from the current number of members for existing raid sets as
defined in their superblocks. Example RAID layout changes include:
- raid1 legs being added/removed
- raid4/5/6/10 number of stripes changed (stripe reshaping)
- takeover to higher raid level (e.g. raid5 -> raid6)
When accessing array members, rs->raid_disks must be used in control
loops instead of the potentially larger value in rs->md.raid_disks.
Otherwise it will cause memory access beyond the end of the rs->devs
array.
Fix this by changing code that is prone to out-of-bounds access.
Also fix validate_raid_redundancy() to validate all devices that are
added. Also, use braces to help clean up raid_iterate_devices().
The out-of-bounds memory accesses was discovered using KASAN.
This commit was verified to pass all LVM2 RAID tests (with KASAN
enabled).
Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 90736eb3232d208ee048493f371075e4272e0944 upstream.
Commit 85e123c27d5c ("dm mirror log: round up region bitmap size to
BITS_PER_LONG") introduced a regression on 64-bit architectures in the
lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation.
If the device is shrunk, we need to clear log bits beyond the end of the
device. The code clears bits up to a 32-bit boundary and then calculates
lc->sync_count by summing set bits up to a 64-bit boundary (the commit
changed that; previously, this boundary was 32-bit too). So, it was using
some non-zeroed bits in the calculation and this caused misbehavior.
Fix this regression by clearing bits up to BITS_PER_LONG boundary.
Fixes: 85e123c27d5c ("dm mirror log: round up region bitmap size to BITS_PER_LONG")
Cc: stable@vger.kernel.org
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9ae6e8b1c9bbf6874163d1243e393137313762b7 upstream.
During postsuspend dm-era does the following:
1. Archives the current era
2. Commits the metadata, as part of the RPC call for archiving the
current era
3. Stops the worker
Until the worker stops, it might write to the metadata again. Moreover,
these writes are not flushed to disk immediately, but are cached by the
dm-bufio client, which writes them back asynchronously.
As a result, the committed metadata of a suspended dm-era device might
not be consistent with the in-core metadata.
In some cases, this can result in the corruption of the on-disk
metadata. Suppose the following sequence of events:
1. Load a new table, e.g. a snapshot-origin table, to a device with a
dm-era table
2. Suspend the device
3. dm-era commits its metadata, but the worker does a few more metadata
writes until it stops, as part of digesting an archived writeset
4. These writes are cached by the dm-bufio client
5. Load the dm-era table to another device.
6. The new instance of the dm-era target loads the committed, on-disk
metadata, which don't include the extra writes done by the worker
after the metadata commit.
7. Resume the new device
8. The new dm-era target instance starts using the metadata
9. Resume the original device
10. The destructor of the old dm-era target instance is called and
destroys the dm-bufio client, which results in flushing the cached
writes to disk
11. These writes might overwrite the writes done by the new dm-era
instance, hence corrupting its metadata.
Fix this by committing the metadata after the worker stops running.
stop_worker uses flush_workqueue to flush the current work. However, the
work item may re-queue itself and flush_workqueue doesn't wait for
re-queued works to finish.
This could result in the worker changing the metadata after they have
been committed, or writing to the metadata concurrently with the commit
in the postsuspend thread.
Use drain_workqueue instead, which waits until the work and all
re-queued works finish.
Fixes: eec40579d8487 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to
establish DM_TYPE_NVME_BIO_BASED.
Also, remove 3 'struct mapped_device *md' variables in __map_bio() which
masked the same variable that is available within __map_bio()'s scope.
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 85e123c27d5cbc22cfdc01de1e2ca1d9003a02d0 upstream.
The code in dm-log rounds up bitset_size to 32 bits. It then uses
find_next_zero_bit_le on the allocated region. find_next_zero_bit_le
accesses the bitmap using unsigned long pointers. So, on 64-bit
architectures, it may access 4 bytes beyond the allocated size.
Fix this bug by rounding up bitset_size to BITS_PER_LONG.
This bug was found by running the lvm2 testsuite with kasan.
Fixes: 29121bd0b00e ("[PATCH] dm mirror log: bitset_size fix")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea23994edc4169bd90d7a9b5908c6ccefd82fa40 upstream.
The RAID0 layout is irrelevant if all members have the same size so the
array has only one zone. It is *also* irrelevant if the array has two
zones and the second zone has only one device, for example if the array
has two members of different sizes.
So in that case it makes sense to allow assembly even when the layout is
undefined, like what is done when the array has only one zone.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1e267742283a4b5a8ca65755c44166be27e9aa0f ]
Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.
Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.
Let's take pers_lock at the beginning of function to ensure reentrancy.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 40f567bbb3b0639d2ec7d1c6ad4b1b018f80cf19 upstream.
The function kzalloc() in detached_dev_do_request() can fail, so its
return value should be checked.
Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices")
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 64c54d9244a4efe9bc6e9c98e13c4bbb8bb39083 upstream.
The bug is here:
if (!rdev || rdev->desc_nr != nr) {
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each_rcu(), so it is incorrect to assume that the
iterator value will be NULL if the list is empty or no element
found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
and lead to invalid memory access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'pdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org
Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc8738343eefc4ea8afb6122826dea48eacde514 upstream.
The bug is here:
if (!rdev)
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element found.
Otherwise it will bypass the NULL check and lead to invalid memory
access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'rdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org
Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update")
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 57668f0a4cc4083a120cc8c517ca0055c4543b59 upstream.
Raid456 module had allowed to achieve failed state. It was fixed by
fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.
This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.
Reproduction steps:
mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
mkfs.xfs /dev/md126 -f
mount /dev/md126 /mnt/root/
fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
--bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
--time_based --group_reporting --name=throughput-test-job
--eta-newline=1 &
echo 1 > /sys/block/nvme2n1/device/device/remove
echo 1 > /sys/block/nvme1n1/device/device/remove
[ 1475.787779] Call Trace:
[ 1475.793111] __schedule+0x2a6/0x700
[ 1475.799460] schedule+0x38/0xa0
[ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
[ 1475.813856] ? finish_wait+0x80/0x80
[ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
[ 1475.828281] ? finish_wait+0x80/0x80
[ 1475.834727] ? finish_wait+0x80/0x80
[ 1475.841127] ? finish_wait+0x80/0x80
[ 1475.847480] md_handle_request+0x119/0x190
[ 1475.854390] md_make_request+0x8a/0x190
[ 1475.861041] generic_make_request+0xcf/0x310
[ 1475.868145] submit_bio+0x3c/0x160
[ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
[ 1475.882070] iomap_dio_bio_actor+0x175/0x390
[ 1475.889149] iomap_apply+0xff/0x310
[ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.909974] iomap_dio_rw+0x2f2/0x490
[ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.923680] ? atime_needs_update+0x77/0xe0
[ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
[ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
[ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
[ 1475.953403] aio_read+0xd5/0x180
[ 1475.959395] ? _cond_resched+0x15/0x30
[ 1475.965907] io_submit_one+0x20b/0x3c0
[ 1475.972398] __x64_sys_io_submit+0xa2/0x180
[ 1475.979335] ? do_io_getevents+0x7c/0xc0
[ 1475.986009] do_syscall_64+0x5b/0x1a0
[ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1476.000255] RIP: 0033:0x7f11fc27978d
[ 1476.006631] Code: Bad RIP value.
[ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
Cc: stable@vger.kernel.org
Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4caae58406f8ceb741603eee460d79bacca9b1b5 upstream.
The device-mapper framework provides a mechanism to mark targets as
immutable (and hence fail table reloads that try to change the target
type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's
feature flags to prevent switching the verity target with a different
target type.
Fixes: a4ffc152198e ("dm: add verity target")
Cc: stable@vger.kernel.org
Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bfe2b0146c4d0230b68f5c71a64380ff8d361f8b upstream.
dm-stats can be used with a very large number of entries (it is only
limited by 1/4 of total system memory), so add rescheduling points to
the loops that iterate over the entries.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 567dd8f34560fa221a6343729474536aa7ede4fd upstream.
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to
report the current key to userspace. However, this is not a constant-time
operation and it may leak information about the key via timing, via cache
access patterns or via the branch predictor.
Change dm-crypt's key printing to use "%c" instead of "%02x". Also
introduce hex2asc() that carefully avoids any branching or memory
accesses when converting a number in the range 0 ... 15 to an ascii
character.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d3f2a14b8906df913cb04a706367b012db94a6e8 upstream.
The "r" variable shadows an earlier "r" that has function scope. It
means that we accidentally return success instead of an error code.
Smatch has a warning for this:
drivers/md/dm-integrity.c:4503 dm_integrity_ctr()
warn: missing error code 'r'
Fixes: 7eada909bfd7 ("dm: add integrity target")
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f6dc633761006f974701d4c88da71ab68670749 upstream.
Commit d208b89401e0 ("dm: fix mempool NULL pointer race when
completing IO") didn't go far enough.
When bio_end_io_acct ends the count of in-flight I/Os may reach zero
and the DM device may be suspended. There is a possibility that the
suspend races with dm_stats_account_io.
Fix this by adding percpu "pending_io" counters to track outstanding
dm_io. Move kicking of suspend queue to dm_io_dec_pending(). Also,
rename md_in_flight_bios() to dm_in_flight_bios() and update it to
iterate all pending_io counters.
Fixes: d208b89401e0 ("dm: fix mempool NULL pointer race when completing IO")
Cc: stable@vger.kernel.org
Co-developed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 08c1af8f1c13bbf210f1760132f4df24d0ed46d6 upstream.
It is possible to set up dm-integrity in such a way that the
"tag_size" parameter is less than the actual digest size. In this
situation, a part of the digest beyond tag_size is ignored.
In this case, dm-integrity would write beyond the end of the
ic->recalc_tags array and corrupt memory. The corruption happened in
integrity_recalc->integrity_sector_checksum->crypto_shash_final.
Fix this corruption by increasing the tags array so that it has enough
padding at the end to accomodate the loop in integrity_recalc() being
able to write a full digest size for the last member of the tags
array.
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cd9c88da171a62c4b0f1c70e50c75845969fbc18 ]
It appears like cmd could be a Spectre v1 gadget as it's supplied by a
user and used as an array index. Prevent the contents of kernel memory
from being leaked to userspace via speculative execution by using
array_index_nospec.
Signed-off-by: Jordy Zomer <jordy@pwning.systems>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6fc51504388c1a1a53db8faafe9fff78fccc7c87 ]
Explicitly convert unsigned int in the right of the conditional
expression to int to match the left side operand and the return type,
fixing the following compiler warning:
drivers/md/dm-crypt.c:2593:43: warning: signed and unsigned
type in conditional expression [-Wsign-compare]
Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service")
Signed-off-by: Aashish Sharma <shraash@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 85bca3c05b6cca31625437eedf2060e846c4bbad ]
Corrupt metadata could trigger an out of bounds write.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7df835a32a8bedf7ce88efcfa7c9b245b52ff139 ]
Commit b0140891a8cea3 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj. Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.
On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.
Fixes: b0140891a8ce ("md: Fix race when creating a new md device.")
Fixes: d62633873590 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream.
On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.
Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.
This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().
Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.
This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.
* Disable swap
* Tune vm settings to promote regular writeback
$ echo 50 > /proc/sys/vm/dirty_expire_centisecs
$ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
$ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
* Create 8 dmcrypt devices based on files on a tmpfs
* Create and mount an ext4 filesystem on each crypt devices
* Run stress-ng --hdd 8 within one of above filesystems
Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():
99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork
|
--ret_from_fork
kthread
worker_thread
|
--99.92%--process_one_work
|
|--80.52%--kcryptd_crypt
| |
| |--62.58%--mempool_alloc
| | |
| | --62.24%--crypt_page_alloc
| | |
| | --61.51%--__percpu_counter_compare
| | |
| | --61.34%--__percpu_counter_sum
| | |
| | |--58.68%--_raw_spin_lock_irqsave
| | | |
| | | --58.30%--native_queued_spin_lock_slowpath
| | |
| | --0.69%--cpumask_next
| | |
| | --0.51%--_find_next_bit
| |
| |--10.61%--crypt_convert
| | |
| | |--6.05%--xts_crypt
...
After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.
|--8.15%--mempool_alloc
| |
| |--3.93%--crypt_page_alloc
| | |
| | --3.75%--__alloc_pages
| | |
| | --3.62%--get_page_from_freelist
| | |
| | --3.22%--rmqueue_bulk
| | |
| | --2.59%--_raw_spin_lock
| | |
| | --2.57%--native_queued_spin_lock_slowpath
| |
| --3.05%--_raw_spin_lock_irqsave
| |
| --2.49%--native_queued_spin_lock_slowpath
Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 224b0683228c5f332f9cee615d85e75e9a347170 ]
Except for the IDA none of the allocations in bcache_device_init is
unwound on error, fix that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5ba03936c05584b6f6f79be5ebe7e5036c1dd252 upstream.
Similar to [1], this patch fixes the same bug in raid10. Also cleanup the
comments.
[1] commit 2417b9869b81 ("md/raid1: properly indicate failure when ending
a failed write request")
Cc: stable@vger.kernel.org
Fixes: 7cee6d4e6035 ("md/raid10: end bio when the device faulty")
Signed-off-by: Wei Shuyu <wsy@dogben.com>
Acked-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 054bee16163df023e2589db09fd27d81f7ad9e72 upstream.
LVM doesn't like it when the target returns different values from what
was set in the constructor. Fix dm-writecache so that the returned
table values are exactly the same as requested values.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b6e58b5466b2959f83034bead2e2e1395cca8aeb upstream.
remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().
For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:
general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
Hardware name: QEMU Standard PC
RIP: 0010:metadata_ll_load_ie+0x14/0x30
Call Trace:
sm_metadata_count_is_more_than_one+0xb9/0xe0
dm_tm_shadow_block+0x52/0x1c0
shadow_step+0x59/0xf0
remove_raw+0xb2/0x170
dm_btree_remove+0xf4/0x1c0
dm_pool_delete_thin_device+0xc3/0x140
pool_message+0x218/0x2b0
target_message+0x251/0x290
ctl_ioctl+0x1c4/0x4d0
dm_ctl_ioctl+0xe/0x20
__x64_sys_ioctl+0x7b/0xb0
do_syscall_64+0x40/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixing it by only assign new_root when removal succeeds
Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5faafc77f7de69147d1e818026b9a0cbf036a7b2 ]
Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device. There are a couple
of repercussions to this:
- The first allocation after the commit is likely to take longer than
normal as it searches for a free block in an area that is likely to
have very few free blocks (if any).
- Any free blocks it finds will have been recently freed. Reusing them
means we have fewer old copies of the metadata to aid recovery from
hardware error.
Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0c1f3193b1cdd21e7182f97dc9bca7d284d18a15 ]
The third parameter of module_param() is permissions for the sysfs node
but it looks like it is being used as the initial value of the parameter
here. In fact, false here equates to omitting the file from sysfs and
does not affect the value of require_signatures.
Making the parameter writable is not simple because going from
false->true is fine but it should not be possible to remove the
requirement to verify a signature. But it can be useful to inspect the
value of this parameter from userspace, so change the permissions to
make a read-only file in sysfs.
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7e768532b2396bcb7fbf6f82384b85c0f1d2f197 upstream.
If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().
Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c699a0db2d62e3bbb7f0bf35c87edbc8d23e3062 upstream.
The following commands will crash the kernel:
modprobe brd rd_size=1048576
dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"
The reason is that when we test for zero chunk size, we jump to the label
bad_read_metadata without setting the "r" variable. The function
snapshot_ctr destroys all the structures and then exits with "r == 0". The
kernel then crashes because it falsely believes that snapshot_ctr
succeeded.
In order to fix the bug, we set the variable "r" to -EINVAL.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7abfabaf5f805f5171d133ce6af9b65ab766e76a upstream.
Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.
So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>
Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream.
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.
This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).
For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.
commit 8b57251f9a91f5e5a599de7549915d2d226cc3af upstream.
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".
Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>