linux

iv/linux

Author	SHA1	Message	Date
Jiange Zhao	3e183e2fae	drm/amdgpu: Add MB_REQ_MSG_READY_TO_RESET response when VF get FLR notification. When guest received FLR notification from host, it would lock adapter into reset state. There will be no more job submission and hardware access after that. Then it should send a response to host that it has prepared for host reset. Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-08-16 15:17:57 -04:00
Victor Zhao	124e8b1990	drm/amdgpu: Extend full access wait time in guest - Extend wait time and add retry, currently 6s * 2times - Change timing algorithm Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-08-09 15:42:10 -04:00
Jingwen Chen	798c511548	drm/amdgpu: SRIOV flr_work should take write_lock [Why] If flr_work takes read_lock, then other threads who takes read_lock can access hardware when host is doing vf flr. [How] flr_work should take write_lock to avoid this case. Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-07-13 11:48:09 -04:00
Jack Zhang	3c2a01cb0f	drm/amdgpu/sriov Stop data exchange for wholegpu reset [Why] When host trigger a whole gpu reset, guest will keep waiting till host finish reset. But there's a work queue in guest exchanging data between vf&pf which need to access frame buffer. During whole gpu reset, frame buffer is not accessable, and this causes the call trace. [How] After vf get reset notification from pf, stop data exchange. Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-01-13 23:47:39 -05:00
Jiange Zhao	3aa883ac8e	drm/amdgpu/SRIOV: Extend VF reset request wait period In Virtualization case, when one VF is sending too many FLR requests, hypervisor would stop responding to this VF's request for a long period of time. This is called event guard. During this period of cooling time, guest driver should wait instead of doing other things. After this period of time, guest driver would resume reset process and return to normal. Currently, guest driver would wait 12 seconds and return fail if it doesn't get response from host. Solution: extend this waiting time in guest driver and poll response periodically. Poll happens every 6 seconds and it will last for 60 seconds. v2: change the max repetition times from number to macro. Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Acked-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-12-15 11:35:35 -05:00
Liu ChengZhe	2a9787dcf5	drm/amdgpu: Do gpu recovery when no job is running In function flr_work, we should do gpu recovery when no job is running. Fix the logic by inverting it. v2: modify the description Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-09-15 17:24:18 -04:00
Dennis Li	6049db43d6	drm/amdgpu: change reset lock from mutex to rw_semaphore clients don't need reset-lock for synchronization when no GPU recovery. v2: change to return the return value of down_read_killable. v3: if GPU recovery begin, VF ignore FLR notification. Reviewed-by: Monk Liu <monk.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-24 12:23:48 -04:00
Dennis Li	53b3f8f40e	drm/amdgpu: refine codes to avoid reentering GPU recovery if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-24 12:22:56 -04:00
jqdeng	9a1cddd637	drm/amdgpu: Fix repeatly flr issue Only for no job running test case need to do recover in flr notification. For having job in mirror list, then let guest driver to hit job timeout, and then do recover. Signed-off-by: jqdeng <Emily.Deng@amd.com> Acked-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-18 18:22:02 -04:00
Christian König	f1403342eb	drm/amdgpu: revert "fix system hang issue during GPU reset" The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit `df9c8d1aa2`. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-08-14 16:22:40 -04:00
Dennis Li	df9c8d1aa2	drm/amdgpu: fix system hang issue during GPU reset when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Suggested-by: Luben Tukov <luben.tuikov@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-07-27 16:21:37 -04:00
Monk Liu	ff1f03a7b8	drm/amdgpu: use static mmio offset for NV mailbox what: with the new "req_init_data" handshake we need to use mailbox before do IP discovery, so in mxgpu_nv.c file the original SOC15_REG method won'twork because that depends on IP discovery complete first. how: so the solution is to always use static MMIO offset for NV+ mailbox registers. HW team confirm us all MAILBOX registers will be at the same offset for all ASICs, no IP discovery needed for those registers Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-04-01 14:44:43 -04:00
Monk Liu	aa53bc2edb	drm/amdgpu: introduce new request and its function 1) modify xgpu_nv_send_access_requests to support new idh request 2) introduce new function: req_gpu_init_data() which is used to notify host to prepare vbios/ip-discovery/pfvf exchange Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-04-01 14:44:43 -04:00
Monk Liu	4d130238a7	drm/amdgpu: cleanup idh event/req for NV headers 1) drop the headers from AI in mxgpu_nv.c, should refer to mxgpu_nv.h 2) the IDH_EVENT_MAX is not used and not aligned with host side so drop it 3) the IDH_TEXT_MESSAG was provided in host but not defined in guest Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-04-01 14:44:43 -04:00
zhengbin	6df3dab619	drm/amdgpu: use true, false for bool variable in mxgpu_nv.c Fixes coccicheck warning: drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c:255:2-20: WARNING: Assignment of 0/1 to bool variable drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c:267:2-20: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-12-23 15:00:01 -05:00
Monk Liu	1512d064f5	drm/amdgpu: fix double gpu_recovery for NV of SRIOV issues: gpu_recover() is re-entered by the mailbox interrupt handler mxgpu_nv.c fix: we need to bypass the gpu_recover() invoke in mailbox interrupt as long as the timeout is not infinite (thus the TDR will be triggered automatically after time out, no need to invoke gpu_recover() through mailbox interrupt. Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-12-18 16:09:12 -05:00
Jiange Zhao	3636169cc0	drm/amdgpu: Add SRIOV mailbox backend for Navi1x Mimic the ones for Vega10, add mailbox backend for Navi1x Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 10:09:52 -05:00

17 Commits