19 Commits

Author SHA1 Message Date
Luben Tuikov
10a135969f drm/amdgpu: Fix amdgpu_ras_eeprom_init()
[ Upstream commit dce4400e6516d18313d23de45b5be8a18980b00e ]

No need to account for the 2 bytes of EEPROM
address--this is now well abstracted away by
the fixes the the lower layers.

Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-18 13:40:17 +02:00
John Clements
4bfb74282f drm/amdgpu: added RAS EEPROM device support check
updated RAS EEPROM init/threshold sequences to check for device support

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:29:18 -04:00
Guchun Chen
9b856defbe drm/amdgpu: update eeprom once specifying one bigger threshold(v3)
During driver's probe, when it hits bad gpu tag in eeprom i2c
init calling(the tag was set when reported bad page reaches
bad page threshold in last driver's working loop), there are
some strategys to deal with the cases:

1. when the module parameter amdgpu_bad_page_threshold = 0,
that means page retirement feature is disabled, so just resetting
the eeprom is fine.
2. When amdgpu_bad_page_threshold is not 0, and moreover, user
sets one bigger valid data in order to make current boot up
succeeds, correct eeprom header tag and do not break booting.
3. For other cases, driver's probe will be broken.

v2: Just update eeprom header tag instead of resetting the whole
    table header when user sets one bigger threshold data.

v3: Use dev_info/dev_err to print PCI device information, which
    helps in mGPU case.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:25 -04:00
Guchun Chen
e8fbaf0342 drm/amdgpu: break GPU recovery once it's in bad state(v4)
When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:54 -04:00
Guchun Chen
9c06f91ff2 drm/amdgpu: schedule ras recovery when reaching bad page threshold(v2)
Once the bad page saved to eeprom reaches the configured
threshold, ras recovery will be issued to notify user.

v2: Fix spelling typo.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:46 -04:00
Guchun Chen
b82e65a935 drm/amdgpu: break driver init process when it's bad GPU(v5)
When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:31 -04:00
Guchun Chen
1d6a9d122d drm/amdgpu: add bad gpu tag definition
This tag will be hired for bad gpu detection in eeprom's access.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:16 -04:00
Guchun Chen
c84d46707e drm/amdgpu: validate bad page threshold in ras(v3)
Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:25:58 -04:00
Andrey Grodzovsky
9015d60c9e drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device
Puts the i2c adapter in common place for sharing by RAS
and upcoming data read from FRU EEPROM feature.

v2:
Move i2c adapter to amdgpu_pm and rename it.

v3: Move i2c adapter init to ASIC specific code and get rid
of the switch case in amdgpu_device

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-16 16:21:32 -04:00
John Clements
ef1caf48bd drm/amdgpu: Add Arcturus D342 page retire support
Check Arcturus SKU type to select I2C address of page retirement EEPROM

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:17:32 -05:00
John Clements
5985ebbe78 drm/amdgpu: Resolved offchip EEPROM I/O issue
Updated target I2C address

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-26 12:20:22 -05:00
Andrey Grodzovsky
cf52ecc8b6 drm/amdgpu: Use ARCTURUS in RAS EEPROM.
Add Arcturus EEPROM/I2C support in generic EEPROM code.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-25 16:50:10 -04:00
Guchun Chen
5222d26146 drm/amdgpu: remove redundant variable definition
No need to define the same variables in each loop of the function.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:10:59 -05:00
Andrey Grodzovsky
db338e1663 drm/amdgpu:Fix EEPROM checksum calculation.
Fix checksum calculation after manually resetting the table.
Unify reset and empty EEPROM init flow.
Protect the table reset with lock.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 15:30:44 -05:00
Andrey Grodzovsky
d01b400b1a drm/amdgpu: Add amdgpu_ras_eeprom_reset_table
This will allow to reset the table on the fly.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:06:27 -05:00
Andrey Grodzovsky
4bc2234077 drm/madgpu: Fix EEPROM Checksum calculation.
Fix typo which messed up the calculation.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:19 -05:00
Colin Ian King
92ead9fa6f drm/amdgpu: fix spelling mistake "jumpimng" -> "jumping"
There is a spelling mistake in a DRM_DEBUG_DRIVER debug message.
Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-29 15:52:32 -05:00
Andrey Grodzovsky
691bac9d09 drm/amdgpu: Vega20 SMU I2C HW engine controller.
Implement HW I2C enigne controller to be used by the RAS EEPROM
table manager. This is based on code from ATITOOLs.

v2:
Rename the file and all function prefixes to smu_v11_0_i2c

By Luben's observation always fill the TX fifo to full so
we don't have garbadge interpreted by the slave as valid data.

v3:
Remove preemption disable as the HW I2C controller will not
stop the clock on empty TX fifo and so it's not critical to
keep not empty queue.
Switch to fast mode 400 khz SCL clock for faster read and write.

v5:
Restore clock gating before releasing I2C bus and fix some
style comments.

v6:
squash in warning fix, fix includes (Alex)

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27 09:17:35 -05:00
Andrey Grodzovsky
64f55e6292 drm/amdgpu: Add RAS EEPROM table.
Add RAS EEPROM table manager to eanble RAS errors to be stored
upon appearance and retrived on driver load.

v2: Fix some prints.

v3:
Fix checksum calculation.
Make table record and header structs packed to do correct byte value sum.
Fix record crossing EEPROM page boundry.

v4:
Fix byte sum val calculation for record - look at sizeof(record).
Fix some style comments.

v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27 08:17:14 -05:00