linux

iv/linux

Go to file

Zhao Heming 2fb550de75 md/cluster: fix deadlock when node is doing resync job

commit bca5b0658020be90b6b504ca514fd80110204f71 upstream.

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

2020-12-30 11:51:45 +01:00

arch

um: Remove use of asprinf in umid.c

2020-12-30 11:51:39 +01:00

block

blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick

2020-12-30 11:50:54 +01:00

certs

PKCS#7: Refactor verify_pkcs7_signature()

2019-08-05 18:40:18 -04:00

crypto

crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()

2020-12-30 11:51:35 +01:00

Documentation

KVM: mmu: Fix SPTE encoding of MMIO generation upper half

2020-12-21 13:27:06 +01:00

drivers

md/cluster: fix deadlock when node is doing resync job

2020-12-30 11:51:45 +01:00

jfs: Fix array index bounds check in dbAdjTree

2020-12-30 11:51:40 +01:00

include

binder: add flag to clear buffer on txn complete

2020-12-30 11:51:35 +01:00

init

initramfs: fix clang build failure

2020-12-30 11:51:30 +01:00

ipc

ipc/util.c: sysvipc_find_ipc() incorrectly updates position index

2020-05-20 08:20:16 +02:00

kernel

cpuset: fix race between hotplug work and later CPU offline

2020-12-30 11:51:36 +01:00

lib

lib/syscall: fix syscall registers retrieval on 32-bit platforms

2020-12-11 13:23:32 +01:00

LICENSES

LICENSES: Rename other to deprecated

2019-05-03 06:34:32 -06:00

mm: don't wake kswapd prematurely when watermark boosting is disabled

2020-12-30 11:51:27 +01:00

net

xprtrdma: Fix XDRBUF_SPARSE_PAGES support

2020-12-30 11:51:38 +01:00

samples

samples: bpf: Fix lwt_len_hist reusing previous BPF map

2020-12-30 11:51:12 +01:00

scripts

kconfig: fix return value of do_error_if()

2020-12-30 11:51:29 +01:00

security

ima: Don't modify file descriptor mode on the fly

2020-12-30 11:51:39 +01:00

sound

ASoC: cx2072x: Fix doubly definitions of Playback and Capture streams

2020-12-30 11:51:35 +01:00

tools

perf probe: Fix memory leak when synthesizing SDT probes

2020-12-30 11:51:29 +01:00

usr

initramfs: restore default compression behavior

2020-04-08 09:08:38 +02:00

virt

KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace

2020-12-02 08:49:46 +01:00

.clang-format

clang-format: Update with the latest for_each macro list

2019-08-31 10:00:51 +02:00

.cocciconfig

…

.get_maintainer.ignore

Opt out of scripts/get_maintainer.pl

2019-05-16 10:53:40 -07:00

.gitattributes

…

.gitignore

Modules updates for v5.4

2019-09-22 10:34:46 -07:00

.mailmap

ARM: SoC fixes

2019-11-10 13:41:59 -08:00

COPYING

COPYING: use the new text with points to the license files

2018-03-23 12:41:45 -06:00

CREDITS

MAINTAINERS: Remove Simon as Renesas SoC Co-Maintainer

2019-10-10 08:12:51 -07:00

Kbuild

kbuild: do not descend to ./Kbuild when cleaning

2019-08-21 21:03:58 +09:00

Kconfig

docs: kbuild: convert docs to ReST and rename to *.rst

2019-06-14 14:21:21 -06:00

MAINTAINERS

Documentation/llvm: add documentation on building w/ Clang/LLVM

2020-08-26 10:40:46 +02:00

Makefile

Linux 5.4.85

2020-12-21 13:27:07 +01:00

README

Drop all 00-INDEX files from Documentation/

2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.

Languages

C 97.6%

Assembly 1%

Shell 0.5%

Python 0.3%

Makefile 0.3%