Heming Zhao fff42f2138 md-cluster: fix hanging issue while a new disk adding
The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.

When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.

How to fix:

1. In dlm_lock_sync(), change the wait behaviour from forever to a
   timeout, This could avoid the hanging issue when another node
   fails to handle cluster msg. Another result of this change is
   that if another node receives an unknown msg (e.g. a new msg_type),
   the old code will hang, whereas the new code will timeout and fail.
   This could help cluster_md handle new msg_type from different
   nodes with different kernel/module versions (e.g. The user only
   updates one leg's kernel and monitors the stability of the new
   kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
   design (must successfully unlock ->message_lockres). This commit
   makes this function return an error number when an error occurs.

Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
2024-07-12 01:30:17 +00:00
..
2024-05-21 09:51:42 -07:00
2024-04-01 11:53:37 -06:00
2024-05-09 09:10:58 -04:00
2024-04-01 11:53:37 -06:00
2024-02-20 14:22:51 -05:00
2024-02-20 14:22:51 -05:00
2024-04-01 11:53:37 -06:00
2024-02-20 14:22:51 -05:00
2024-07-05 00:42:04 -06:00
2024-07-05 00:42:04 -06:00
2024-07-05 00:42:04 -06:00
2024-03-21 14:41:00 -07:00
2024-03-06 08:59:53 -08:00