linux

iv/linux

Author	SHA1	Message	Date
Tao Zhou	22106ed0be	drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err bad_page_threshold controls page retirement behavior and it should be also checked. v2: simplify the condition of bad page handling path. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-02-23 17:35:59 -05:00
Tao Zhou	f3cbe70e21	drm/amdgpu: change default behavior of bad_page_threshold parameter Ignore ras umc bad page threshold by default, GPU initialization won't be stopped in this mode. v2: refine the description of bad_page_threshold. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-02-23 17:35:59 -05:00
Luben Tuikov	64a3dbb06a	drm/amdgpu: Add support for RAS table at 0x40000 Add support for RAS table at I2C EEPROM address of 0x40000, since on some ASICs it is not at 0, but at 0x40000. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Kent Russell <kent.russell@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Tested-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-11-17 18:07:33 -05:00
Luben Tuikov	3b8164f808	drm/amdgpu: Decouple RAS EEPROM addresses from chips Abstract RAS I2C EEPROM addresses from chip names, and set their macro definition names to the address they set, not the chip they attach to. Since most chips either use I2C EEPROM address 0 or 40000h for the RAS table start offset, this leaves us with only two macro definitions as opposed to five, and removes the redundancy of four. Cc: Candice Li <candice.li@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-11-09 17:41:42 -05:00
Luben Tuikov	da858deab8	drm/amdgpu: Remove redundant I2C EEPROM address Remove redundant EEPROM_I2C_MADDR_54H address, since we already have it represented (ARCTURUS), and since we don't include the I2C device type identifier in EEPROM memory addresses, i.e. that high up in the device abstraction--we only use EEPROM memory addresses, as memory is continuously represented by EEPROM device(s) on the I2C bus. Add a comment describing what these memory addresses are, how they come about and how they're usually extracted from the device address byte. Cc: Candice Li <candice.li@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: `c9bdc6c3cf` ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-11-09 17:41:42 -05:00
Candice Li	c9bdc6c3cf	drm/amdgpu: Add EEPROM I2C address support for ip discovery 1. Update EEPROM_I2C_MADDR_SMU_13_0_0 to EEPROM_I2C_MADDR_54H 2. Add EEPROM I2C address support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-10-27 15:12:09 -04:00
Candice Li	bc22f8ec46	drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10 Enable RAS EEPROM support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-10-27 15:12:08 -04:00
Candice Li	1582252946	drm/amdgpu: Add EEPROM I2C address for smu v13_0_0 Set correct EEPROM I2C address for smu v13_0_0. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-09-13 14:32:58 -04:00
Stanley.Yang	69691c8235	drm/amdgpu: message smu to update bad channel info It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-15 14:25:16 -04:00
Dave Airlie	54f43c17d6	Merge tag 'drm-misc-next-2022-02-23' of git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v5.18: UAPI Changes: Cross-subsystem Changes: - Split out panel-lvds and lvds dt bindings . - Put yes/no on/off disabled/enabled strings in linux/string_helpers.h and use it in drivers and tomoyo. - Clarify dma_fence_chain and dma_fence_array should never include eachother. - Flatten chains in syncobj's. - Don't double add in fbdev/defio when page is already enlisted. - Don't sort deferred-I/O pages by default in fbdev. Core Changes: - Fix missing pm_runtime_put_sync in bridge. - Set modifier support to only linear fb modifier if drivers don't advertise support. - As a result, we remove allow_fb_modifiers. - Add missing clear for EDID Deep Color Modes in drm_reset_display_info. - Assorted documentation updates. - Warn once in drm_clflush if there is no arch support. - Add missing select for dp helper in drm_panel_edp. - Assorted small fixes. - Improve fb-helper's clipping handling. - Don't dump shmem mmaps in a core dump. - Add accounting to ttm resource manager, and use it in amdgpu. - Allow querying the detected eDP panel through debugfs. - Add helpers for xrgb8888 to 8 and 1 bits gray. - Improve drm's buddy allocator. - Add selftests for the buddy allocator. Driver Changes: - Add support for nomodeset to a lot of drm drivers. - Use drm_module_*_driver in a lot of drm drivers. - Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau, bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86. - Add bridge/it6505. - Create DP and DVI-I connectors in ast. - Assorted nouveau backlight fixes. - Rework amdgpu reset handling. - Add dt bindings for ingenic,jz4780-dw-hdmi. - Support reading edid through aux channel in ingenic. - Add a drm driver for Solomon SSD130x OLED displays. - Add simple support for sharp LQ140M1JW46. - Add more panels to nt35560. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/686ec871-e77f-c230-22e5-9e3bb80f064a@linux.intel.com	2022-02-25 05:50:18 +10:00
Stanley.Yang	8bbd4d83a6	drm/amdgpu: Reset OOB table error count info The OOB table error count info should be reset after reset eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:13:00 -05:00
Andrey Grodzovsky	d0fb18b535	drm/amdgpu: Move reset sem into reset_domain We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html	2022-02-09 12:17:32 -05:00
Luben Tuikov	2f60dd5076	drm/amd: Expose the FRU SMU I2C bus Expose both SMU I2C buses. Some boards use the same bus for both the RAS and FRU EEPROMs and others use different buses. This enables the additional I2C bus and sets the right buses to use for RAS and FRU EEPROM access. Cc: Roy Sun <Roy.Sun@amd.com> Co-developed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-01-27 15:49:48 -05:00
Kent Russell	68daadf3d6	drm/amdgpu: Add kernel parameter support for ignoring bad page threshold When a GPU hits the bad_page_threshold, it will not be initialized by the amdgpu driver. This means that the table cannot be cleared, nor can information gathering be performed (getting serial number, BDF, etc). If the bad_page_threshold kernel parameter is set to -2, continue to initialize the GPU, while printing a warning to dmesg that this action has been done v2: squash in Luben's fix to restore RAS info reporting Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-28 14:26:12 -04:00
Kent Russell	8483fdfea7	drm/amdgpu: Warn when bad pages approaches 90% threshold dmesg doesn't warn when the number of bad pages approaches the threshold for page retirement. WARN when the number of bad pages is at 90% or greater for easier checks and planning, instead of waiting until the GPU is full of bad pages. Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-28 14:26:12 -04:00
Kent Russell	dcd5ea9f94	drm/amdgpu: Clarify error when hitting bad page threshold Change the error message when the bad_page_threshold is reached, explicitly stating that the GPU will not be initialized. Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mukul Joshi <Mukul.Joshi@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-20 11:43:57 -04:00
Michel Dänzer	114518ff3b	drm/amdgpu: Drop inline from amdgpu_ras_eeprom_max_record_count This was unusual; normally, inline functions are declared static as well, and defined in a header file if used by multiple compilation units. The latter would be more involved in this case, so just drop the inline declaration for now. Fixes compile failure building for ppc64le on RHEL 8: In file included from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:32, from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:33: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_recovery_init’: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h:90:17: error: inlining failed in call to ‘always_inline’ ‘amdgpu_ras_eeprom_max_record_count’: function body not available 90 \| inline uint32_t amdgpu_ras_eeprom_max_record_count(void); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1985:34: note: called from here 1985 \| max_eeprom_records_len = amdgpu_ras_eeprom_max_record_count(); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: `c84d46707e` "drm/amdgpu: validate bad page threshold in ras(v3)" Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-09-16 09:56:24 -04:00
Luben Tuikov	cc947bf91b	drm/amdgpu: Process any VBIOS RAS EEPROM address We can now process any RAS EEPROM address from VBIOS. Generalize so as to compute the top three bits of the 19-bit EEPROM address, from any byte returned as the "i2c address" from VBIOS. Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-08-30 14:59:33 -04:00
John Clements	14fb496a84	drm/amdgpu: set RAS EEPROM address from VBIOS update to latest atombios fw table Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-08-05 21:17:59 -04:00
Dan Carpenter	64598e23de	drm/amdgpu: return -EFAULT if copy_to_user() fails If copy_to_user() fails then this should return -EFAULT instead of -EINVAL. Fixes: `c65b0805e7` ("drm/amdgpu: RAS EEPROM table is now in debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-08 17:47:28 -04:00
Dan Carpenter	b8badd507a	drm/amdgpu: unlock on error in amdgpu_ras_debugfs_table_read() This error path needs to unlock before returning. While we're at it, the correct error code from copy_to_user() failure is -EFAULT, not -EINVAL. Fixes: `c65b0805e7` ("drm/amdgpu: RAS EEPROM table is now in debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-08 17:47:16 -04:00
Dan Carpenter	3006c92455	drm/amdgpu: fix a signedness bug in __verify_ras_table_checksum() If amdgpu_eeprom_read() returns a negative error code then the error handling checks: if (res < buf_size) { The problem is that "buf_size" is a u32 so negative values are type promoted to a high positive values and the condition is false. Fix this by changing the type of "buf_size" to int. Fixes: `63d4c081a5` ("drm/amdgpu: Optimize EEPROM RAS table I/O") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-08 15:17:02 -04:00
Alex Deucher	d456f3875a	drm/amdgpu: fix 64 bit divide in eeprom code pos is 64 bits. Fixes: `c65b0805e7` ("drm/amdgpu: RAS EEPROM table is now in debugfs") Cc: luben.tuikov@amd.com Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	c65b0805e7	drm/amdgpu: RAS EEPROM table is now in debugfs Add "ras_eeprom_size" file in debugfs, which reports the maximum size allocated to the RAS table in EEROM, as the number of bytes and the number of records it could store. For instance, $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size 262144 bytes or 10921 records $_ Add "ras_eeprom_table" file in debugfs, which dumps the RAS table stored EEPROM, in a formatted way. For instance, $cat ras_eeprom_table Signature Version FirstOffs Size Checksum 0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA Index Offset ErrType Bank/CU TimeStamp Offs/Addr MemChl MCUMCID RetiredPage 0 0x00014 ue 0x00 0x00000000607608DC 0x000000000000 0x00 0x00 0x000000000000 1 0x0002C ue 0x00 0x00000000607608DC 0x000000001000 0x00 0x00 0x000000000001 2 0x00044 ue 0x00 0x00000000607608DC 0x000000002000 0x00 0x00 0x000000000002 3 0x0005C ue 0x00 0x00000000607608DC 0x000000003000 0x00 0x00 0x000000000003 4 0x00074 ue 0x00 0x00000000607608DC 0x000000004000 0x00 0x00 0x000000000004 5 0x0008C ue 0x00 0x00000000607608DC 0x000000005000 0x00 0x00 0x000000000005 6 0x000A4 ue 0x00 0x00000000607608DC 0x000000006000 0x00 0x00 0x000000000006 7 0x000BC ue 0x00 0x00000000607608DC 0x000000007000 0x00 0x00 0x000000000007 8 0x000D4 ue 0x00 0x00000000607608DD 0x000000008000 0x00 0x00 0x000000000008 $_ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	63d4c081a5	drm/amdgpu: Optimize EEPROM RAS table I/O Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	017dad64db	drm/amdgpu: Get rid of test function The code is now tested from userspace. Remove already macroed out test function. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	0686627b3f	drm/amdgpu: Some renames Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	d7edde3dea	drm/amdgpu: Nerf buff buff --> buf. Essentially buffer abbreviates to buf, remove 1/2 of it, or just the iron part, as opposed to just the Er, Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	e4e6a58935	drm/amdgpu: Use explicit cardinality for clarity RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <guchun.chen@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	803c6ebdd3	drm/amdgpu: Simplify RAS EEPROM checksum calculations Rename update_table_header() to write_table_header() as this function is actually writing it to EEPROM. Use kernel types; use u8 to carry around the checksum, in order to take advantage of arithmetic modulo 8-bits (256). Tidy up to 80 columns. When updating the checksum, just recalculate the whole thing. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	dce4400e65	drm/amdgpu: Fix amdgpu_ras_eeprom_init() No need to account for the 2 bytes of EEPROM address--this is now well abstracted away by the fixes the the lower layers. Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	cf696091d3	drm/amdgpu: Return result fix in RAS The low level EEPROM write method, doesn't return 1, but the number of bytes written. Thus do not compare to 1, instead, compare to greater than 0 for success. Other cleanup: if the lower layers returned -errno, then return that, as opposed to overwriting the error code with one-fits-all -EINVAL. For instance, some return -EAGAIN. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:41 -04:00
Luben Tuikov	16ef797737	drm/amdgpu: EEPROM: add explicit read and write Add explicit amdgpu_eeprom_read() and amdgpu_eeprom_write() for clarity. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Luben Tuikov	1fab841ff6	drm/amdgpu: RAS xfer to read/write Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Luben Tuikov	a43996573a	drm/amdgpu: Rename misspelled function Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Luben Tuikov	c28aa44de8	drm/amdgpu: RAS: EEPROM --> RAS In amdgpu_ras_eeprom.c--the interface from RAS to EEPROM, rename macros from EEPROM to RAS, to indicate that the quantities and objects are RAS specific, not EEPROM. We can decrease the RAS table, or put it in different offset of EEPROM as needed in the future. Remove EEPROM_ADDRESS_SIZE macro definition, equal to 2, from the file and calculations, as that quantity is computed and added on the stack, in the lower layer, amdgpu_eeprom_xfer(). Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Luben Tuikov	edb63a5308	drm/amdgpu: Fix wrap-around bugs in RAS Fix the size of the EEPROM from 256000 bytes to 262144 bytes (256 KiB). Fix a couple or wrap around bugs. If a valid value/address is 0 <= addr < size, the inverse of this inequality (barring negative values which make no sense here) is addr >= size. Fix this in the RAS code. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Luben Tuikov	ccdfbfec9e	drm/amdgpu: RAS and FRU now use 19-bit I2C address Convert RAS and FRU code to use the 19-bit I2C memory address and remove all "slave_addr", as this is now absolved into the 19-bit address. Cc: Jean Delvare <jdelvare@suse.de> Cc: John Clements <john.clements@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-01 00:24:40 -04:00
Alex Deucher	39ed82d1d9	drm/amdgpu: i2c subsystem uses 7 bit addresses Convert from 8 bit to 7 bit. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>	2021-07-01 00:24:39 -04:00
Alex Deucher	24f55c0559	drm/amdgpu/ras: switch ras eeprom handling to use generic helper Use the new helper rather than doing i2c transfers directly. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>	2021-07-01 00:24:39 -04:00
John Clements	52a9df8180	drm/amdgpu: enable ras eeprom on aldebaran enable ras eeprom loading by default on aldebaran Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-04-09 16:54:53 -04:00
Stanley.Yang	970fd19764	drm/amdgpu: fix send ras disable cmd when asic not support ras cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-03-23 23:30:12 -04:00
shaoyunl	e3c1b0712f	drm/amdgpu: Reset the devices in the XGMI hive duirng probe In passthrough configuration, hypervisior will trigger the SBR(Secondary bus reset) to the devices without sync to each other. This could cause device hang since for XGMI configuration, all the devices within the hive need to be reset at a limit time slot. This serial of patches try to solve this issue by co-operate with new SMU which will only do minimum house keeping to response the SBR request but don't do the real reset job and leave it to driver. Driver need to do the whole sw init and minimum HW init to bring up the SMU and trigger the reset(possibly BACO) on all the ASICs at the same time Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Acked-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-03-23 23:10:38 -04:00
Dennis Li	11003c68b1	drm/amdgpu: remove unnecessary reading for epprom header If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-02-26 17:23:49 -05:00
John Clements	3e7bc83e31	drm/amdgpu: enable ras eeprom support for sienna cichlid added I2C address and asic support flag Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-01-05 11:35:46 -05:00
Deepak R Varma	f3729f7b1a	drm/amdgpu/amdgpu: improve code indentation and alignment General code indentation and alignment changes such as replace spaces by tabs or align function arguments as per the coding style guidelines. The patch corrects issues for various amdgpu_*.c files for this driver. Issue reported by checkpatch script. Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-11-02 15:34:29 -05:00
Dennis Li	5eeb45934c	drm/amdgpu: remove redundant GPU reset Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-10-30 00:57:23 -04:00
Dennis Li	40e7ed973a	drm/amdgpu: protect eeprom update from GPU reset because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-10-21 16:13:43 -04:00
John Clements	4bfb74282f	drm/amdgpu: added RAS EEPROM device support check updated RAS EEPROM init/threshold sequences to check for device support Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-04 17:29:18 -04:00
Guchun Chen	9b856defbe	drm/amdgpu: update eeprom once specifying one bigger threshold(v3) During driver's probe, when it hits bad gpu tag in eeprom i2c init calling(the tag was set when reported bad page reaches bad page threshold in last driver's working loop), there are some strategys to deal with the cases: 1. when the module parameter amdgpu_bad_page_threshold = 0, that means page retirement feature is disabled, so just resetting the eeprom is fine. 2. When amdgpu_bad_page_threshold is not 0, and moreover, user sets one bigger valid data in order to make current boot up succeeds, correct eeprom header tag and do not break booting. 3. For other cases, driver's probe will be broken. v2: Just update eeprom header tag instead of resetting the whole table header when user sets one bigger threshold data. v3: Use dev_info/dev_err to print PCI device information, which helps in mGPU case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-04 17:27:25 -04:00

1 2

66 Commits