IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
nbio ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split nbio callbacks
into ras and non-ras ones so gpu driver only
initializes nbio ras callbacks when it manages
nbio ras.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
query_ras_error_address will be invoked to query bad
page address when there is poison data in HBM consumed
by GPU engines.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
umc query_ras_error_count will be invoked to query
umc correctable and uncorrectable error. It will
reset the umc ras error counter after the query.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
umc_v6_7_funcs are callbacks to support umc ras
functionalities in aldebaran
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Driver only manages GFX/SDMA/MMHUB RAS in platforms
that gpu node is connected to cpu through XGMI, other
than that, it queries VBIOS for RAS capabilities.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
These were leftover from the old CI dpm code which was
retired a while ago.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
using the control bits got from host to control registers access.
Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com>
Reviewed-by: Emily.Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
get pf2vf msg info at it's earliest time so that
guest driver can use these info to decide whether
register indirect access enabled.
Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com>
Reviewed-by: Emily.Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The page table of AMDGPU requires an alignment to CPU page so we should
check ioctl parameters for it. Return -EINVAL if some parameter is
unaligned to CPU page, instead of corrupt the page table sliently.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In Mesa, dev_info.gart_page_size is used for alignment and it was
set to AMDGPU_GPU_PAGE_SIZE(4KB). However, the page table of AMDGPU
driver requires an alignment on CPU pages. So, for non-4KB page system,
gart_page_size should be max_t(u32, PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE).
Signed-off-by: Rui Wang <wangr@lemote.com>
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Link: https://github.com/loongson-community/linux-stable/commit/caa9c0a1
[Xi: rebased for drm-next, use max_t for checkpatch,
and reworded commit message.]
Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Tested-by: Dan Horák <dan@danny.cz>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
int write = !(gtt->userflags & AMDGPU_GEM_USERPTR_READONLY);
v2: put short variable declaration last
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add 3 sub flags to notify guest for indirect reg access of
gc, mmhub and ih
The host sets these flags depending on L1 RAP version,
asic and other scenarios. These flags ensure that
there is compatibility between different guest/host/vbios versions.
Signed-off-by: Rohit Khaire <rohit.khaire@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This message is not needed on Aldebaran.
Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When resume from gpu reset, need set MP1 state to UNLOAD before reload SMU
FW otherwise will cause following errors:
[ 121.642772] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR [ 123.801051] [drm] failed to load ucode id (24) [ 123.801055] [drm] psp command (0x6) failed and response status is (0x0) [ 123.801214] [drm:psp_load_smu_fw [amdgpu]] *ERROR* PSP load smu failed!
[ 123.801398] [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [ 123.801536] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22 [ 123.801632] amdgpu 0000:04:00.0: amdgpu: GPU reset(9) failed [ 123.801691] amdgpu 0000:07:00.0: amdgpu: GPU reset(9) failed [ 123.802899] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -22
v2: add error info and including ALDEBARAN also
Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If reset handler is not implemented, reset error before proceeding.
Fixes issue with the following trace -
[ 106.508592] amdgpu 0000:b1:00.0: amdgpu: ASIC reset failed with error, -38 for drm dev, 0000:b1:00.0
[ 106.508972] amdgpu 0000:b1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 106.509116] [drm] PCIE GART of 512M enabled.
[ 106.509120] [drm] PTB located at 0x0000008000000000
[ 106.509136] [drm] VRAM is lost due to GPU reset!
[ 106.509332] [drm] PSP is resuming...
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Starting Arcturus, it supports ih reroute through mmio directly
in bare metal environment. This is also valid for newer asics
such as Aldebaran.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Offset calculation wasn't correct as start addresses are in pfn
not in bytes.
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No need to have special handling for swSMU supported ASICs.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No need to have special handling for swSMU supported ASICs.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix header guard and make internal functions static. Fixes the below warnings:
drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu_reset.h:24:9: warning: '__AMDUGPU_RESET_H__' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
drivers/gpu/drm/amd/amdgpu/aldebaran.c:110:6: warning: no previous prototype for function 'aldebaran_async_reset' [-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/aldebaran_ppt.c:1435:5: warning: no previous prototype for function 'aldebaran_mode2_reset' [-Wmissing-prototypes]
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add aldebaran to devices which support recovery
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Aldebaran uses reset control to support mode2 reset. The sequences to
reset and restore hardware context are specific to a particular
configuration.
v2: Clear bus mastering before reset.
Fix coding style issues, drop unwanted variables and info log.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Expose PG/CG set states functions for other clients
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Adds a function to load a list of FWs as passed by the caller. This is
needed as only a select need to loaded for some use cases.
v2: Omit unrelated change, remove info log, fix return value when count is 0
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This prefers reset control based handling if it's implemented
for a particular ASIC. If not, it takes the legacy path. It uses
the legacy method of preparing environment (job, scheduler tasks)
and restoring environment.
v2: remove unused variable (Alex)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Add generic amdgpu_reset_control to handle different types of resets. It
may be added at device, hive or ip level. Each reset control has a list
of handlers associated with it to handle different types of reset. Reset
control is responsible for choosing the right handler given a particular
reset context.
Handler objects may implement a set of functions on how to handle a
particular type of reset.
prepare_env = Prepare environment/software context (not used currently).
prepare_hwcontext = Prepare hardware context for the reset.
perform_reset = Perform the type of reset.
restore_hwcontext = Restore the hw context after reset.
restore_env = Restore the environment after reset (not used currently).
Reset context carries the context of reset, as of now this is based on
the parameters used for current set of resets.
v2: Fix coding style
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.
[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.
1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.
2. For whole gpu reset(vram lost), do resubmit as the old way.
v2: squash in build fix (Alex)
Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
BO with ttm_bo_type_sg type can also have tiling_flag and metadata.
So so BO type check for only ttm_bo_type_kernel.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reported-by: Tom StDenis <Tom.StDenis@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tiling flag and metadata are only needed for BOs created by
amdgpu_gem_object_create(), so we can remove those from the
base class.
v2: * squash tiling_flags and metadata relared patches into one
* use BUG_ON for non ttm_bo_type_device type when accessing
tiling_flags and metadata._
v3: *include to_amdgpu_bo_user
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use amdgpu_bo_create_user() for all the BO allocations for
ttm_bo_type_device type.
v2: include amdgpu_amdkfd_alloc_gws() as well it calls amdgpu_bo_create()
for ttm_bo_type_device
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Implement a new struct amdgpu_bo_user as subclass of
struct amdgpu_bo and a function to created amdgpu_bo_user
bo with a flag to identify the owner.
v2: amdgpu_bo_to_amdgpu_bo_user -> to_amdgpu_bo_user()
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Allow allocating BO structures with different structure size
than struct amdgpu_bo.
v2: Check bo_ptr_size in all amdgpu_bo_create() caller.
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add VCN3 IB parsing to figure out to which instance we can send the
stream for decode.
v2: remove VCN instance limit as well, fix amdgpu_cs_find_mapping,
check supported formats instead of unsupported.
v3: fix typo and error handling
v4: make sure the message BO is CPU accessible
v5: fix addr calculation once more
v6: only check message buffers
v7: fix constant and use defines
v8: fix create msg calculation
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The VCN3 instances can do both decode as well as encode.
Share the scheduler load balancing score and remove fixing encode to
only the second instance.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-and-Tested-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Allow separate ring to share the same scheduler score.
No functional change.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-and-Tested-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A previous fix I did left a rather complicated loop in
amdgpu_securedisplay_debugfs_write() for what could be expressed in a
simple sprintf, as Rasmus pointed out.
This drops the leading 0x for each byte, but is otherwise
much nicer.
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On initializing the framebuffer, call drm_any_plane_has_format to do a
check if the modifier is supported. drm_any_plane_has_format calls
dm_plane_format_mod_supported which is extended to validate that the
modifier is on the list of the plane's supported modifiers.
The bug was caught using igt-gpu-tools test: kms_addfb_basic.addfb25-bad-modifier
Tested on ChromeOS Zork by turning on the display, running an overlay
test, and running a YT video.
=== Changes from v1 ===
Explicitly handle DRM_FORMAT_MOD_INVALID modifier.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Mark Yacoub <markyacoub@chromium.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the following coccicheck warning:
drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING:
use scnprintf or sprintf
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
That is unused for quite some time now.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use positive logic to check for RAS
support. Rename the function to actually indicate
what it is testing for. Essentially, make the
function a predicate with the correct name.
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Setting amdgpu.noretry=1 as default for Arcturus.
Signed-off-by: Philip Cox <Philip.Cox@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
updated host to send boot config to psp to enable GECC for sienna cichlid
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>