linux

iv/linux

Author	SHA1	Message	Date
Thomas Hellström	845f64bdbf	drm/xe: Introduce a range-fence utility Add generic utility to track range conflicts signaled by a dma-fence. Tracking implemented via an interval tree. An example use case being tracking conflicts for pending (un)binds from multiple bind engines. By being generic ths idea would this could moved to the DRM level and used in multiple drivers for similar problems. v2: Make interval tree functions static (CI) v3: Remove non-static cleanup function (CI) Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:52 -05:00
Francois Dugast	0043a3e8a1	drm/xe/execlist: Log when using execlist submission Make explicit in the log that execlist submission is used to prevent from silently using it over GuC submission. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:52 -05:00
Francois Dugast	72e8d73b71	drm/xe: Cleanup style warnings and errors Fix 6 errors and 20 warnings reported by checkpatch.pl. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:52 -05:00
Francois Dugast	ea82d5aab5	drm/xe/execlist: Remove leftover printk messages Those look like leftover debug and are not even being used. If they were real debug/info, they should be using the drm helpers. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:52 -05:00
Francois Dugast	955c09e2cc	drm/xe: Rely on kmalloc/kzalloc log message Those messages are unnecessary because a generic message is already produced in case of allocation failure. Besides, this also removes a misuse of the XE_IOCTL_DBG macro. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:51 -05:00
Lucas De Marchi	4d18eac032	drm/xe: Use FIELD_PREP/FIELD_GET for tile id encoding Use FIELD_PREP()/FIELD_GET() to encode the tile id into flags. Besides protecting for eventual overflow it also makes it easier to see a new flag can't be added as BIT(7). Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20230718193924.3084759-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:51 -05:00
Lucas De Marchi	0d39b6daa5	drm/xe: Normalize XE_VM_FLAG* names Rename XE_VM_FLAGS_64K to XE_VM_FLAG_64K to follow the other names and s/GT/TILE/ that got missed in commit 08dea7674533 ("drm/xe: Move migration from GT to tile"). Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20230718193924.3084759-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:37 -05:00
Matthew Auld	ee82d2da9c	drm/xe: add missing bulk_move reset It looks like bulk_move is set during object construction, but is only removed on object close, however in various places we might not yet have an actual fd to close, like on the error paths for the gem_create ioctl, and also one internal user for the evict_test_run_gt() selftest. Try to handle those cases by manually resetting the bulk_move. This should prevent triggering: WARNING: CPU: 7 PID: 8252 at drivers/gpu/drm/ttm/ttm_bo.c:327 ttm_bo_release+0x25e/0x2a0 [ttm] v2 (Nirmoy): - It should be safe to just unconditionally call __xe_bo_unset_bulk_move() in most places. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:37 -05:00
Matthew Auld	5a142f9c67	drm/xe/selftests: restart GT after xe_bo_restore_kernel() Test seems to be failing badly after calling xe_bo_restore_kernel(). Taking a snapshot of the CTB and copying back a potentially old version seems risky, depending on what might have been inflight. Also it seems snapshotting the ADS object and copying back results in serious breakage. Normally when calling xe_bo_restore_kernel() we always fully restart the GT, which re-intializes such things. We could potentially skip saving and restoring such objects in xe_bo_evict_all() however seems quite fragile not to also restart the GT. Try to do that here by triggering a GT reset. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Acked-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:37 -05:00
Matthew Auld	939902913a	drm/xe/selftests: hold rpm for ccs_test_migrate() The GPU job will keep the device awake, however assumption here is that caller of xe_migrate_clear() is also holding mem_access.ref otherwise we hit the asserts in xe_sa_bo_flush_write() prior to the job construction. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:37 -05:00
Matthew Auld	6a0612aeab	drm/xe/selftests: hold rpm for evict_test_run_device() We are calling fairly low level things like xe_bo_restore_kernel() which expect caller to be holding mem_access.ref. Since we are doing stuff like evict_all we likely don't want to race with rpm suspend, since that potentially wants to do the same thing, so just wrap the whole test. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	e3d2309250	drm/xe: add lockdep annotation for xe_device_mem_access_get() The atomics here might hide potential issues, also rpm core is not holding any lock when calling our rpm resume callback, so add a dummy lock with the idea that xe_pm_runtime_resume() is eventually going to be called when we are holding it. This only needs to happen once and then lockdep can validate all callers and their locks. v2: (Thomas Hellström) - Prefer static lockdep_map instead of full blown mutex. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	7d623575a3	drm/xe: drop xe_device_mem_access_get() from invalidation_vma Lockdep gives the following splat: [ 594.158863] ffff888140da53f0 (&vm->userptr.notifier_lock){++++}-{3:3}, at: vma_userptr_invalidate+0xeb/0x330 [xe] [ 594.158921] but task is already holding lock: [ 594.158926] ffffffff82761940 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: unmap_vmas+0x0/0x1c0 [ 594.158941] which lock already depends on the new lock. [ 594.158947] the existing dependency chain (in reverse order) is: [ 594.158953] -> #5 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}: [ 594.158961] fs_reclaim_acquire+0x68/0xd0 [ 594.158969] __kmem_cache_alloc_node+0x2c/0x1b0 [ 594.158975] kmalloc_node_trace+0x1d/0xb0 [ 594.158983] alloc_worker+0x18/0x50 [ 594.158989] init_rescuer.part.0+0x13/0xa0 [ 594.158995] workqueue_init+0xdf/0x210 [ 594.159001] kernel_init_freeable+0x5c/0x2f0 [ 594.159009] kernel_init+0x11/0x1a0 [ 594.159017] ret_from_fork+0x29/0x50 [ 594.159023] -> #4 (fs_reclaim){+.+.}-{0:0}: [ 594.159031] fs_reclaim_acquire+0xa0/0xd0 [ 594.159037] __kmem_cache_alloc_node+0x2c/0x1b0 [ 594.159042] kmalloc_trace+0x20/0xb0 [ 594.159048] acpi_device_add+0x25a/0x3f0 [ 594.159056] acpi_add_single_object+0x387/0x750 [ 594.159063] acpi_bus_check_add+0x108/0x280 [ 594.159069] acpi_bus_scan+0x34/0xf0 [ 594.159075] acpi_scan_init+0xed/0x2b0 [ 594.159082] acpi_init+0x21e/0x520 [ 594.159087] do_one_initcall+0x53/0x260 [ 594.159092] kernel_init_freeable+0x18a/0x2f0 [ 594.159099] kernel_init+0x11/0x1a0 [ 594.159105] ret_from_fork+0x29/0x50 [ 594.159110] -> #3 (acpi_device_lock){+.+.}-{3:3}: [ 594.159117] __mutex_lock+0x95/0xd10 [ 594.159122] acpi_enable_wakeup_device_power+0x30/0x120 [ 594.159130] __acpi_device_wakeup_enable+0x34/0x110 [ 594.159138] acpi_pm_set_device_wakeup+0x55/0x140 [ 594.159143] __pci_enable_wake+0x56/0xb0 [ 594.159150] pci_finish_runtime_suspend+0x35/0x80 [ 594.159157] pci_pm_runtime_suspend+0xb5/0x1a0 [ 594.159162] __rpm_callback+0x3c/0x110 [ 594.159170] rpm_callback+0x58/0x70 [ 594.159176] rpm_suspend+0x15c/0x6f0 [ 594.159182] pm_runtime_work+0x9b/0xb0 [ 594.159188] process_one_work+0x263/0x520 [ 594.159195] worker_thread+0x4d/0x3b0 [ 594.159200] kthread+0xeb/0x120 [ 594.159206] ret_from_fork+0x29/0x50 [ 594.159211] -> #2 (acpi_wakeup_lock){+.+.}-{3:3}: [ 594.159218] __mutex_lock+0x95/0xd10 [ 594.159223] acpi_pm_set_device_wakeup+0x7a/0x140 [ 594.159228] __pci_enable_wake+0x77/0xb0 [ 594.159234] pci_pm_runtime_resume+0x70/0xd0 [ 594.159240] __rpm_callback+0x3c/0x110 [ 594.159246] rpm_callback+0x58/0x70 [ 594.159252] rpm_resume+0x50d/0x7a0 [ 594.159258] rpm_resume+0x267/0x7a0 [ 594.159264] __pm_runtime_resume+0x45/0x90 [ 594.159270] xe_pm_runtime_resume_and_get+0x12/0x50 [xe] [ 594.159314] xe_device_mem_access_get+0x97/0xc0 [xe] [ 594.159346] hw_engines+0x65/0xf0 [xe] [ 594.159380] seq_read_iter+0x10d/0x4b0 [ 594.159385] seq_read+0x9e/0xd0 [ 594.159390] full_proxy_read+0x4e/0x80 [ 594.159396] vfs_read+0xb6/0x310 [ 594.159401] ksys_read+0x60/0xe0 [ 594.159406] do_syscall_64+0x38/0x90 [ 594.159413] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 594.159419] -> #1 (&xe->mem_access.lock){+.+.}-{3:3}: [ 594.159427] xe_device_mem_access_get+0x43/0xc0 [xe] [ 594.159457] xe_gt_tlb_invalidation_vma+0x53/0x190 [xe] [ 594.159490] invalidation_fence_init+0x1d2/0x2c0 [xe] [ 594.159529] __xe_pt_unbind_vma+0x151/0x4e0 [xe] [ 594.159564] vm_bind_ioctl+0x48a/0xae0 [xe] [ 594.159602] async_op_work_func+0x20c/0x530 [xe] [ 594.159634] process_one_work+0x263/0x520 [ 594.159640] worker_thread+0x4d/0x3b0 [ 594.159646] kthread+0xeb/0x120 [ 594.159650] ret_from_fork+0x29/0x50 [ 594.159655] -> #0 (&vm->userptr.notifier_lock){++++}-{3:3}: [ 594.159663] __lock_acquire+0x16fa/0x2850 [ 594.159670] lock_acquire+0xd2/0x2e0 [ 594.159676] down_write+0x36/0xd0 [ 594.159681] vma_userptr_invalidate+0xeb/0x330 [xe] [ 594.159714] __mmu_notifier_invalidate_range_start+0x239/0x2a0 [ 594.159722] unmap_vmas+0x1ac/0x1c0 [ 594.159727] unmap_region+0xb5/0x120 [ 594.159732] do_vmi_align_munmap+0x2be/0x430 [ 594.159739] do_vmi_munmap+0xea/0x120 [ 594.159744] __vm_munmap+0x9c/0x160 [ 594.159750] __x64_sys_munmap+0x12/0x20 [ 594.159756] do_syscall_64+0x38/0x90 [ 594.159761] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 594.159768] other info that might help us debug this: [ 594.159773] Chain exists of: &vm->userptr.notifier_lock --> fs_reclaim --> mmu_notifier_invalidate_range_start [ 594.159785] Possible unsafe locking scenario: [ 594.159790] CPU0 CPU1 [ 594.159794] ---- ---- [ 594.159797] lock(mmu_notifier_invalidate_range_start); [ 594.159802] lock(fs_reclaim); [ 594.159808] lock(mmu_notifier_invalidate_range_start); [ 594.159814] lock(&vm->userptr.notifier_lock); [ 594.159819] The VM should be holding a mem_access.ref so this looks like it should be a false positive and we can just drop the explicit mem_access in xe_gt_tlb_invalidation(). The GGTT invalidation path also takes care to hold mem_access.ref so should be fine there also, and we already assert that we hold access.ref for the GuC communication underneath. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	e018f44b29	drm/xe/ggtt: prime ggtt->lock against FS_RECLAIM Increase the sensitivity of the ggtt->lock by priming it against FS_RECLAIM, such that allocating memory while holding will result in lockdep splats. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	7eed01a926	drm/xe: drop xe_device_mem_access_get() from guc_ct_send The callers should already be holding the mem_access reference, before calling into this. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	03af26c9c9	drm/xe: ensure correct access_put ordering Only call access_put after dropping the forcewake. In theory the device could suspend, but really we want to start asserting that we have a mem_access.ref when touching mmio. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	7da1d76ff6	drm/xe/mmio: grab mem_access in xe_mmio_ioctl Any kind of device memory access should first ensure the device is not suspended, mmio included. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:36 -05:00
Matthew Auld	2d3ab1fa31	drm/xe/guc_pc: add missing mem_access for freq_rpe_show The mem_access is meant to cover any kind of device level memory access, mmio included. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Matthew Auld	6bfbd0c589	drm/xe/debugfs: grab mem_access around forcewake We need keep the device awake when performing any kind of mmio operation. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/279 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Matthew Auld	2d30332a5e	drm/xe/vm: tidy up xe_runtime_pm usage The xe_device_mem_access_get() should be all that's needed here and should now work as expected, without any strange races. In theory should be no functional changes here. Reported-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Matthew Auld	a00b8f1aae	drm/xe: fix xe_device_mem_access_get() races It looks like there is at least one race here, given that the pm_runtime_suspended() check looks to return false if we are in the process of suspending the device (RPM_SUSPENDING vs RPM_SUSPENDED). We later also do xe_pm_runtime_get_if_active(), but since the device is suspending or has now suspended, this doesn't do anything either. Following from this we can potentially return from xe_device_mem_access_get() with the device suspended or about to be, leading to broken behaviour. Attempt to fix this by always grabbing the runtime ref when our internal ref transitions from 0 -> 1. The hard part is then dealing with the runtime_pm callbacks also calling xe_device_mem_access_get() and deadlocking, which the pm_runtime_suspended() check prevented. v2: - ct->lock looks to be primed with fs_reclaim, so holding that and then allocating memory will cause lockdep to complain. Now that we unconditionally grab the mem_access.lock around mem_access_{get,put}, we need to change the ordering wrt to grabbing the ct->lock, since some of the runtime_pm routines can allocate memory (or at least that's what lockdep seems to suggest). Hopefully not a big deal. It might be that there were already issues with this, just that the atomics where "hiding" the potential issues. v3: - Use Thomas Hellström' idea with tracking the active task that is executing in the resume or suspend callback, in order to avoid recursive resume/suspend calls deadlocking on itself. - Split the ct->lock change. v4: - Add smb_mb() around accessing the pm_callback_task for extra safety. (Thomas Hellström) v5: - Clarify the kernel-doc for the mem_access.lock, given that it is quite strange in what it protects (data vs code). The real motivation is to aid lockdep. (Rodrigo Vivi) v6: - Split out the lock change. We still want this as a lockdep aid but only for the xe_device_mem_access_get() path. Sticking a lock on the put() looks be a no-go, also the runtime_put() there is always async. - Now that the lock is gone move to atomics and rely on the pm code serialising multiple callers on the 0 -> 1 transition. - g2h_worker_func() looks to be the next issue, given that suspend-resume callbacks are using CT, so try to handle that. v7: - Add xe_device_mem_access_get_if_ongoing(), and use it in g2h_worker_func(). v8 (Anshuman): - Just always grab the rpm, instead of just on the 0 -> 1 transition, which is a lot clearer and simplifies the code quite a bit. v9: - Make sure we also adjust the CT fast-path with if-active. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Anshuman Gupta <anshuman.gupta@intel.com> Acked-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Anshuman Gupta	09d88e3beb	drm/xe/pm: Init pcode and restore vram on power lost Don't init pcode and restore VRAM objects in vain. We can rely on primary GT GUC_STATUS to detect whether card has really lost power even when d3cold is allowed by xe. Adding d3cold.lost_power flag to avoid pcode init and vram restoration. Also cleaning up the TODO code comment. v2: - %s/xe_guc_has_lost_power()/xe_guc_in_reset(). - Used existing gt instead of new variable. [Rodrigo] - Added kernel-doc function comment. [Rodrigo] - xe_guc_in_reset() return true if failed to get fw. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230718080703.239343-6-anshuman.gupta@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Anshuman Gupta	2ef08b9802	drm/xe/pm: Toggle d3cold_allowed using vram_usages Adding support to control d3cold by using vram_usages metric from ttm resource manager. When root port is capable of d3cold but xe has disallowed d3cold due to vram_usages above vram_d3ccold_threshol. It is required to disable d3cold to avoid any resume failure because root port can still transition to d3cold when all of pcie endpoints and {upstream, virtual} switch ports will transition to d3hot. Also cleaning up the TODO code comment. v2: - Modify d3cold.allowed in xe_pm_d3cold_allowed_toggle. [Riana] - Cond changed (total_vram_used_mb < xe->d3cold.vram_threshold) according to doc comment. v3: - Added enum instead of true/false argument in d3cold_toggle(). [Rodrigo] - Removed TODO comment. [Rodrigo] Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230718080703.239343-5-anshuman.gupta@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Anshuman Gupta	b2d756199b	drm/xe/pm: Add vram_d3cold_threshold Sysfs Add per pci device vram_d3cold_threshold Sysfs to control the d3cold allowed knob. Adding a d3cold structure embedded in xe_device to encapsulate d3cold related stuff. v2: - Check total vram before initializing default threshold. [Riana] - Add static scope to vram_d3cold_threshold DEVICE_ATTR. [Riana] v3: - Fixed cosmetics review comment. [Riana] - Fixed CI Hook failures. - Used drmm_mutex_init(). v4: - Fixed kernel-doc warnings. v5: - Added doc explaining need for the device sysfs. [Rodrigo] - Removed TODO comment. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Riana Tauro <riana.tauro@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230718080703.239343-4-anshuman.gupta@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Anshuman Gupta	fddebcbf7a	drm/xe/pm: Refactor xe_pm_runtime_init Wrap xe_pm_runtime_init inside xe_pm_init. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230718080703.239343-3-anshuman.gupta@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:35 -05:00
Anshuman Gupta	ac0be3b5b2	drm/xe/pm: Add pci d3cold_capable support Adding pci d3cold_capable check in order to initialize d3cold_allowed as false statically. It avoids vram save/restore latency during runtime suspend/resume v2: - Added else block to xe_pci_runtime_idle. [Rodrigo] Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230718080703.239343-2-anshuman.gupta@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:34 -05:00
Riana Tauro	1737785ae5	drm/xe: remove gucrc disable from suspend path Currently GuCRC is disabled in suspend path for xe. Rc6 is a prerequiste to enable s0ix and should not be disabled for s2idle. There is no requirement to disable GuCRC for S3+. Remove it from xe_guc_pc_stop, thus removing from suspend path. Retain the call in other places where xe_guc_pc_stop is called. v2: add description and return statement to kernel-doc (Rodrigo) v3: update commit message (Rodrigo) v4: add mem_access_get to the gucrc disable function Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:31 -05:00
Francois Dugast	3e8e7ee6a3	drm/xe: Cleanup style warnings Reduce the number of warnings reported by checkpatch.pl from 118 to 48 by addressing those warnings types: LEADING_SPACE LINE_SPACING BRACES TRAILING_SEMICOLON CONSTANT_COMPARISON BLOCK_COMMENT_STYLE RETURN_VOID ONE_SEMICOLON SUSPECT_CODE_INDENT LINE_CONTINUATIONS UNNECESSARY_ELSE UNSPECIFIED_INT UNNECESSARY_INT MISORDERED_TYPE Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:31 -05:00
Francois Dugast	b8c1ba831e	drm/xe: Prevent flooding the kernel log with XE_IOCTL_ERR Lower log level of XE_IOCTL_ERR macro to debug in order to prevent flooding kernel log. v2: Rename XE_IOCTL_ERR to XE_IOCTL_DBG (Rodrigo Vivi) v3: Rebase v4: Fix style, remove unrelated change about __FILE__ and __LINE__ Link: https://lists.freedesktop.org/archives/intel-xe/2023-May/004704.html Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	5ce5830344	drm/xe: Fix typos Fix minor issues: remove extra ';' and s/Initialise/Initialize/. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	f5b85ab62b	drm/xe: Cleanup COMPLEX_MACRO style issues Remove some style issues of type COMPLEX_MACRO reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	80c58bdf0e	drm/xe: Cleanup TRAILING_WHITESPACE style issues Remove all existing style issues of type TRAILING_WHITESPACE reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	763931d25c	drm/xe: Cleanup CODE_INDENT style issues Remove all existing style issues of type CODE_INDENT reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	4ab5901cc0	drm/xe: Cleanup POINTER_LOCATION style issues Remove all existing style issues of type POINTER_LOCATION reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	fb1d55efdf	drm/xe: Cleanup OPEN_BRACE style issues Remove almost all existing style issues of type OPEN_BRACE reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Francois Dugast	4cd6d49259	drm/xe: Cleanup SPACING style issues Remove almost all existing style issues of type SPACING reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00
Brian Welty	04194a4f78	drm/xe: Fix lockdep warning from xe_vm_madvise We need to hold vm->lock before the xe_vm_is_closed_or_banned(). Else we get this splat: [ 802.555227] ------------[ cut here ]------------ [ 802.555234] WARNING: CPU: 33 PID: 3122 at drivers/gpu/drm/xe/xe_vm.h:60 [ 802.555515] CPU: 33 PID: 3122 Comm: xe_exec_fault_m Tainted: ... [ 802.555709] Call Trace: [ 802.555714] <TASK> [ 802.555720] ? __warn+0x81/0x170 [ 802.555737] ? xe_vm_madvise_ioctl+0x2de/0x440 [xe] Fixes: 9d858b69b0cf ("drm/xe: Ban a VM if rebind worker hits an error") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:13 -05:00
Brian Welty	b1f8f4b5ee	drm/xe: Fix BUG_ON during bind with prefetch It was missed that print_op needs to include DRM_GPUVA_OP_PREFETCH. Else we hit the impossible BUG_ON: [ 886.371040] ------------[ cut here ]------------ [ 886.371047] kernel BUG at drivers/gpu/drm/xe/xe_vm.c:2234! [ 886.371216] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 886.371229] CPU: 1 PID: 3132 Comm: xe_exec_fault_m [ 886.371257] RIP: 0010:vm_bind_ioctl_ops_create+0x45f/0x470 [xe] ... [ 886.371517] Call Trace: [ 886.371525] <TASK> [ 886.371531] ? __die_body+0x1a/0x60 [ 886.371546] ? die+0x38/0x60 [ 886.371557] ? do_trap+0x10a/0x120 [ 886.371568] ? vm_bind_ioctl_ops_create+0x45f/0x470 [xe] v2: add debug print for PREFETCH in print_op Fixes: b06d47be7c83 ("drm/xe: Port Xe to GPUVA") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:36:23 -05:00
Matthew Auld	356010a1a0	drm/xe/mmio: update gt_count when probing multi-tile It looks like the single-tile PVC in CI dies during module load when doing the pcode init. From the logs we try to access the address 0000000000138124 which doesn't map to anything, however 0x138124 also looks to be the PCODE_MAILBOX register. So looks like the per-tile mmio register mapping is NULL. During probe the tile count is potentially trimmed, since we don't know the real count until we actually probe the device. This seems to be the case for single-tile PVC or similar devices. However it looks like the gt_count is never adjusted to respect this updated tile count. As a result when later doing some for_each_gt() loop, like we do for the pcode, we can get back some GT that maps to some non-existent tile which hasn't been properly set up, leading to crashes. Try to fix this by adjusting the gt_count after probing the tiles for real. v2: Fix typo so it actually builds References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/383 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	35c8a96439	drm/xe: handle TLB invalidations from CT fast-path In various test cases that put the system under a heavy load, we can sometimes see errors with missed TLB invalidations. In such cases we see the interrupt arrive for the invalidation from the GuC, however the actual processing of the completion is pushed onto a workqueue and handled with all the other CT stuff, which might take longer than expected. Since we expect TLB invalidations to complete within a reasonable amount of time (at most ~250ms), and they do seem pretty critical, allow handling directly from the CT fast-path. v2 (José): - Actually use the correct spinlock/unlock_irq, since pending_lock is grabbed from IRQ. v3: - Don't publish the TLB fence on the list until after we fully initialize it and successfully do the CT send. The list is now only protected by the spin_lock pending_lock and we can't hold that across the entire TLB send operation. v4 (Matt Brost): - Be careful with racing against fast CT path writing the seqno, before we have actually published the fence. References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/297 References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/320 References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/449 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	0b688f9b28	drm/xe/ct: update g2h outstanding for CTB capture Looks to always to be zero when inspecting the CTB dump. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	4aa5e3594f	drm/xe/tlb: print seqno_recv on fence TLB timeout To help debugging, sample the current seqno_recv and dump it out if we encounter a TLB timeout for the fences path. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	2ca01fe31b	drm/xe/tlb: also update seqno_recv during reset We might have various kworkers waiting for TLB flushes to complete which are not tracked with an explicit TLB fence, however at this stage that will never happen since the CT is already disabled, so make sure we signal them here under the assumption that we have completed a full GT reset. v2: - We need to use seqno - 1 here. After acquiring ct->lock the seqno is actually the next users seqno and not the pending one. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	7b24cc3e30	drm/xe/gt: tweak placement for signalling TLB fences after GT reset Assumption here is that submission is disabled along with CT, and full GT reset will also nuke TLBs, so should be safe to signal all in-flight TLB fences, but only after the actual reset so move the placement slightly. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	a4d362bbed	drm/xe/ct: serialise fast_lock during CT disable The fast-path CT could be running as we enter a runtime-suspend or potentially a GT reset, however here we only use the ct->fast_lock and not the full ct->lock. Before disabling the CT, also serialise against the fast_lock to ensure any in-progress work finishes before we start nuking the CT related stuff. Once we disable ct->enabled and drop the lock, any new work should fail gracefully, and anything that was in progress should be finished. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:23 -05:00
Matthew Auld	4803f6e26f	drm/xe/tlb: increment next seqno after successful CT send If we are in the middle of a GT reset or similar the CT might be disabled, such that the CT send fails. However we already incremented gt->tlb_invalidation.seqno which might lead to warnings, since we effectively just skipped a seqno: 0000:00:02.0: drm_WARN_ON(expected_seqno != msg[0]) Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:22 -05:00
Matthew Auld	dad33831d8	drm/xe/ct: hold fast_lock when reserving space for g2h Reserving and checking for space on the g2h side relies on the fast_lock, and not the CT lock since we need to release space from the fast CT path. Make sure we hold it when checking for space and reserving it. The main concern is calling __g2h_release_space() as we are reserving something and since the info.space and info.g2h_outstanding operations are not atomic we can get some nonsense values back. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:22 -05:00
Matthew Auld	c4bbc32e09	drm/xe: hold mem_access.ref for CT fast-path Just checking xe_device_mem_access_ongoing() is not enough, we also need to hold the reference otherwise the ref can transition from 1 -> 0 as we enter g2h_read(), leading to warnings. While we can't do a full rpm sync in the IRQ, we can keep the device awake if the ref is non-zero. Introduce a new helper for this and set it to work in for the CT fast-path. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:22 -05:00
Matthew Auld	86ed09250e	drm/xe/tlb: ensure we access seqno_recv once Ensure we load gt->tlb_invalidation.seqno_recv once, and use that for our seqno checking. The gt->tlb_invalidation_seqno_past is a shared global variable and can potentially change at any point here. However the checks here need to operate on a stable version of seqno_recv for this to make any sense. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:22 -05:00
Matthew Auld	38fa29dc2b	drm/xe/tlb: drop unnecessary smp_wmb() wake_up_all() and wait_event_timeout() already have the correct barriers as per https://www.kernel.org/doc/Documentation/memory-barriers.txt. This should ensure that the seqno_recv write can't be re-ordered wrt to the actual wake_up_all() i.e we get woken up but there is no write. The reader side with wait_event_timeout() also has the correct barriers. With that drop the hand rolled smp_wmb(), which is anyway missing some kind of matching barrier on the reader side. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:35:22 -05:00

1 2 3 4 5 ...

1235317 Commits