accel/habanalabs: abort device reset for consecutive heartbeat failures

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This commit is contained in:
Tomer Tayar 2023-12-25 00:28:36 +02:00 committed by Oded Gabbay
parent d0df8a35a7
commit 246d8b6cfb

View File

@ -1769,14 +1769,16 @@ kill_processes:
hdev->device_cpu_disabled = false; hdev->device_cpu_disabled = false;
hdev->reset_info.hard_reset_pending = false; hdev->reset_info.hard_reset_pending = false;
/*
* Put the device in an unusable state if there are 2 back to back resets due to
* fatal errors.
*/
if (hdev->reset_info.reset_trigger_repeated && if (hdev->reset_info.reset_trigger_repeated &&
(hdev->reset_info.prev_reset_trigger == (hdev->reset_info.prev_reset_trigger == HL_DRV_RESET_FW_FATAL_ERR ||
HL_DRV_RESET_FW_FATAL_ERR)) { hdev->reset_info.prev_reset_trigger ==
/* if there 2 back to back resets from FW, HL_DRV_RESET_HEARTBEAT)) {
* ensure driver puts the driver in a unusable state
*/
dev_crit(hdev->dev, dev_crit(hdev->dev,
"%s Consecutive FW fatal errors received, stopping hard reset\n", "%s Consecutive fatal errors, stopping hard reset\n",
dev_name(&(hdev)->pdev->dev)); dev_name(&(hdev)->pdev->dev));
rc = -EIO; rc = -EIO;
goto out_err; goto out_err;