1265376 Commits

Author SHA1 Message Date
Rodrigo Vivi
8ed9aaae39
drm/xe: Force wedged state and block GT reset upon any GPU hang
In many validation situations when debugging GPU Hangs,
it is useful to preserve the GT situation from the moment
that the timeout occurred.

This patch introduces a module parameter that could be used
on situations like this.

If xe.wedged module parameter is set to 2, Xe will be declared
wedged on every single execution timeout (a.k.a. GPU hang) right
after devcoredump snapshot capture and without attempting any
kind of GT reset and blocking entirely any kind of execution.

v2: Really block gt_reset from guc side. (Lucas)
    s/wedged/busted (Lucas)

v3: - s/busted/wedged
    - Really use global_flags (Dafna)
    - More robust timeout handling when wedging it.

v4: A really robust clean exit done by Matt Brost.
    No more kernel warns on unbind.

v5: Simplify error message (Lucas)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dafna Hirschfeld <dhirschfeld@habana.ai>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Himanshu Somaiya <himanshu.somaiya@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-24 12:12:58 -04:00
Rodrigo Vivi
692818678e
drm/xe: declare wedged upon GuC load failure
Let's block the device upon any GuC load failure.
But let's continue with the probe so guc logs can be read
from the debugfs.

v2: - s/wedged/busted
    - do not block probe or we lose guc_logs in debugfs (Matt)

v3: - s/busted/wedged

v4: Do not change __xe_guc_upload return. (Himal)

Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-2-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-24 12:12:58 -04:00
Rodrigo Vivi
fb74b205cd
drm/xe: Introduce a simple wedged state
Introduce a very simple 'wedged' state where any attempt
to access the GPU is entirely blocked.

On some critical cases, like on gt_reset failure, we need to
block any other attempt to use the GPU. Otherwise we are at
a risk of reaching cases that would force us to reboot the machine.

So, when this cases are identified we corner and block any GPU
access. No IOCTL and not even another GT reset should be attempted.

The 'wedged' state in Xe is an end state with no way back.
Only a device "re-probe" (unbind + bind) can restore the GPU access.

v2: - s/wedged/busted (Lucas)
    - use unbind+bind instead of module reload (Lucas)
    - added more info on unbind operations and instruction on bug report
    - only print the message once.

v3: - s/busted/wedged (Ashutosh, Tvrtko, Thomas)
    - don't assume user has sudo and tee available (Lucas)

v4: - remove unnecessary cases around ct communication or migration.

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> #v2
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-24 12:12:58 -04:00
José Roberto de Souza
c8d4524ecc drm/xe: Add INSTDONE registers to devcoredump
This registers contains important information that can help with debug
of GPU hangs.

While at it also fixing the double line jump at the end of engine
registers for CCS engines.

v2:
- print other INSTDONE registers

v3:
- add for_each_geometry/compute_dss()

v4:
- print one slice_common_instdone per glice in DG2+

v5:
- rename registers prefix from DG2 to XEHPG (Zhanjun)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240424140319.61651-3-jose.souza@intel.com
2024-04-24 09:06:39 -07:00
José Roberto de Souza
082a634f60 drm/xe: Add helpers to loop over geometry and compute DSS
Some DSS can only be available for geometry while others can only be
available for compute.
So here adding helpers to loop only available DSS for given usage.

User of this helper will come in the next patch.

v2:
- drop has_dss()

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240424140319.61651-2-jose.souza@intel.com
2024-04-24 09:06:38 -07:00
José Roberto de Souza
f332625733 drm/xe: Store xe_hw_engine in xe_hw_engine_snapshot
A future patch will require gt and xe device structs, so here
replacing class by hwe.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240424140319.61651-1-jose.souza@intel.com
2024-04-24 09:06:37 -07:00
Michal Wajdeczko
49f853c78e drm/xe/pf: Clamp maximum execution quantum to 100s
GuC is silently clamping values of the execution quantum and
preemption timeout KLVs to 100s. Perform explicit clamping on the
driver side as later there is no way to read back values used by
the firmware and we shouldn't mislead the user about actual values
being used when we print them in dmesg or debugfs.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419123543.270-3-michal.wajdeczko@intel.com
2024-04-24 15:32:26 +02:00
Michal Wajdeczko
5a8c292f74 drm/xe/guc: Update VF configuration KLVs definitions
GuC firmware specification says that maximum value for the execution
quantum KLV is 100s and anything exceeding that will be clamped.
The same limitation applies to the preemption timeout KLV.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419123543.270-2-michal.wajdeczko@intel.com
2024-04-24 15:32:24 +02:00
Michal Wajdeczko
2cab6319b4 drm/xe/pf: Expose SR-IOV policy settings over debugfs
We already have functions to configure SR-IOV policies.
Allow to tweak those policy settings over debugfs.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423131244.2045-4-michal.wajdeczko@intel.com
2024-04-24 15:18:41 +02:00
Michal Wajdeczko
b00240b6a2 drm/xe/pf: Expose SR-IOV VF control commands over debugfs
We already have functions to control the VF.
Allow to control the VF using debugfs.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423131244.2045-3-michal.wajdeczko@intel.com
2024-04-24 15:18:40 +02:00
Michal Wajdeczko
e42a51fb9c drm/xe/pf: Expose SR-IOV VFs configuration over debugfs
We already have functions to configure VF resources and to print
actual provisioning details. Expose this functionality in debugfs
to allow experiment with different settings or inspect details in
case of unexpected issues with the provisioning.

As debugfs attributes are per-VF, we use parent d_inode->i_private
to store VFID, similarly how we did for per-GT attributes.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423131244.2045-2-michal.wajdeczko@intel.com
2024-04-24 15:18:38 +02:00
Michal Wajdeczko
11294bf38f drm/xe/kunit: Add PF service tests
Start with basic tests for VF/PF ABI version negotiation. As we
treat all platforms in the same way, we can run the tests on one
platform. More tests will likely come later.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423180436.2089-6-michal.wajdeczko@intel.com
2024-04-24 15:10:46 +02:00
Michal Wajdeczko
98e6280592 drm/xe/pf: Add SR-IOV GuC Relay PF services
We already have mechanism that allows a VF driver to communicate
with the PF driver, now add PF side handlers for VF2PF requests
defined in version 1.0 of VF/PF GuC Relay ABI specification.

The VF2PF_HANDSHAKE request must be used by the VF driver to
negotiate the ABI version prior to sending any other request.
We will reset any negotiated version later during FLR.

The outcome of the VF2PF_QUERY_RUNTIME requests depends on actual
platform, for legacy platforms used as SDV is provided as-is, for
latest platforms it is preliminary, and might be changed.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423180436.2089-5-michal.wajdeczko@intel.com
2024-04-24 15:10:42 +02:00
Michal Wajdeczko
dec793860d drm/xe: Add few more GT register definitions
While we are not using these registers right now, they are part
of some runtime register lists that PF driver share with VFs on
some legacy platforms that we might want to support as SDV.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423180436.2089-4-michal.wajdeczko@intel.com
2024-04-24 15:10:41 +02:00
Michal Wajdeczko
1cb4db30cf drm/xe: Add helper to calculate adjusted register offset
Our MMIO accessing functions automatically adjust addresses for the
media registers based on mmio.adj_limit and mmio.adj_offset logic.
Move it to the separate helper to avoid code duplication and to
allow using it by the upcoming changes to PF driver code.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423180436.2089-3-michal.wajdeczko@intel.com
2024-04-24 15:10:40 +02:00
Michal Wajdeczko
8f21f82d8b drm/xe/guc: Add GuC Relay ABI version 1.0 definitions
This initial GuC Relay ABI specification includes messages for ABI
version negotiation and to query values of runtime/fuse registers.

We will start handling those messages on the PF driver soon.

Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423180436.2089-2-michal.wajdeczko@intel.com
2024-04-24 15:10:39 +02:00
Thomas Hellström
06e7139a03 drm/xe: Fix unexpected backmerge results
The recent backmerge from drm-next to drm-xe-next brought with it
some silent unexpected results. One code snippet was added twice
and a partial revert had merge errors. Fix that up to
reinstate the affected code as it was before the backmerge.

v2:
- Commit log message rewording (Lucas DeMarchi)

Fixes: 79790b6818e9 ("Merge drm/drm-next into drm-xe-next")
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423121114.39325-1-thomas.hellstrom@linux.intel.com
2024-04-24 10:25:51 +02:00
Rodrigo Vivi
869e54d4d5
drm/xe: make xe_pm_runtime_lockdep_map a static struct
Fix the new sparse warning:

>> drivers/gpu/drm/xe/xe_pm.c:72:20: sparse: sparse: symbol
'xe_pm_runtime_lockdep_map' was not declared. Should it be static?

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404191329.EZzOTzwK-lkp@intel.com/
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240422201454.699089-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-23 10:43:22 -04:00
Michal Wajdeczko
48c64d495f drm/xe/guc: Fix arguments passed to relay G2H handlers
By default CT code was passing just payload of the G2H event
message, while Relay code expects full G2H message including
HXG header which contains DATA0 field. Fix that.

Fixes: 26d4481ac23f ("drm/xe/guc: Start handling GuC Relay event messages")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419150351.358-1-michal.wajdeczko@intel.com
2024-04-22 20:08:04 +02:00
Michal Wajdeczko
d3b80dc7aa drm/xe/pf: Fix xe_gt_sriov_pf_config_print_available_ggtt()
This function is using internal helper pf_get_spare_ggtt() that
expects PF's master mutex to be locked. Fix that.

Fixes: ac6598aed1b3 ("drm/xe/pf: Add support to configure SR-IOV VFs")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419141000.314-1-michal.wajdeczko@intel.com
2024-04-22 19:50:47 +02:00
Rodrigo Vivi
783d6cdc82
drm/xe: Kill xe_device_mem_access_{get*,put}
Let's simply convert all the current callers towards direct
xe_pm_runtime access and remove this extra layer of indirection.

No functional change is expected with this patch since
xe_mem_access_get was already using the xe_pm_runtime_get_noresume
at this point.

v2: Convert all the current callers instead of a big refactor
at once.

v3: - Rebased
    - Squashed the GSC/HDCP
    - Added a new case: sriov_pf_policy
    - Improved commit message to highlight that
      there's no functional change in this patch.

Reviewed-by: Matthew Auld <matthew.auld@intel.com> #v2
Cc: Suraj Kandpal <suraj.kandpal@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240418143049.43231-1-rodrigo.vivi@intel.com
2024-04-22 09:03:09 -04:00
Matt Roper
62422b7be4 drm/xe: Define all possible engines in media IP descriptors
Rather than trying to identify exactly which engines are available on
each platform in the IP descriptor, just include the list of all media
engines that the IP could theoretically support (i.e., 8 VCS + 4 VECS).
We still rely on the media fuse registers to tell us which specific
engine instances are actually present on a given platform, so there
shouldn't be any functional change.  This will help prevent mistakes
with engine numbering (for example ambiguity about whether the 2nd VCS
engine on a platform with exactly two engines is numbered "VCS1" or
"VCS2") and will also future-proof the code a bit more in case new SKUs
or platform refreshes extend the engine list in the future.

Note that the media fuse register technically has an 8-bit field for
VECS engine presence starting on Xe2.  However there's still no MMIO
register range reserved for VE engines above VECS3, so VE0-VE3 is still
consider the "maximum" VE engine mask that the driver can support for
now.

Bspec: 52614, 52615, 62567
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417152621.3357990-2-matthew.d.roper@intel.com
2024-04-19 07:26:28 -07:00
Rodrigo Vivi
7af6b11626 drm/i915: Convert intel_runtime_pm_get_noresume towards raw wakeref
In the past, the noresume function was used by the GEM code to ensure
wakelocks were held and bump its usage. This is no longer the case
and this function was totally unused until it started to be used again
by display with commit 77e619a82fc3 ("drm/i915/display: convert inner
wakeref get towards get_if_in_use")

However, on the display code, most of the callers are using the
raw wakeref, rather then the wakelock version. What caused a
major regression caught by CI.

Another option to this patch is to go with the original plan and
use the get_if_in_use variant in the display code, what is enough
to fulfil our needs. Then, an extra patch to delete the unused
_noresume variant.

v2: Keep grabbing wakelock but only assert for wakeref. (Imre)

Cc: Imre Deak <imre.deak@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Fixes: 77e619a82fc3 ("drm/i915/display: convert inner wakeref get towards get_if_in_use")
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10875
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240418223756.68427-1-rodrigo.vivi@intel.com
2024-04-19 11:27:14 +03:00
Ashutosh Dixit
5bc9de065b drm/i915/hwmon: Get rid of devm
When both hwmon and hwmon drvdata (on which hwmon depends) are device
managed resources, the expectation, on device unbind, is that hwmon will be
released before drvdata. However, in i915 there are two separate code
paths, which both release either drvdata or hwmon and either can be
released before the other. These code paths (for device unbind) are as
follows (see also the bug referenced below):

Call Trace:
release_nodes+0x11/0x70
devres_release_group+0xb2/0x110
component_unbind_all+0x8d/0xa0
component_del+0xa5/0x140
intel_pxp_tee_component_fini+0x29/0x40 [i915]
intel_pxp_fini+0x33/0x80 [i915]
i915_driver_remove+0x4c/0x120 [i915]
i915_pci_remove+0x19/0x30 [i915]
pci_device_remove+0x32/0xa0
device_release_driver_internal+0x19c/0x200
unbind_store+0x9c/0xb0

and

Call Trace:
release_nodes+0x11/0x70
devres_release_all+0x8a/0xc0
device_unbind_cleanup+0x9/0x70
device_release_driver_internal+0x1c1/0x200
unbind_store+0x9c/0xb0

This means that in i915, if use devm, we cannot gurantee that hwmon will
always be released before drvdata. Which means that we have a uaf if hwmon
sysfs is accessed when drvdata has been released but hwmon hasn't.

The only way out of this seems to be do get rid of devm_ and release/free
everything explicitly during device unbind.

v2: Change commit message and other minor code changes
v3: Cleanup from i915_hwmon_register on error (Armin Wolf)
v4: Eliminate potential static analyzer warning (Rodrigo)
    Eliminate fetch_and_zero (Jani)
v5: Restore previous logic for ddat_gt->hwmon_dev error return (Andi)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10366
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417145646.793223-1-ashutosh.dixit@intel.com
2024-04-18 17:59:31 -07:00
Himal Prasad Ghimiray
c086bfc6ff drm/xe/pm: Capture errors and handle them
xe_pm_init may encounter failures for various reasons, such as a failure
in initializing drmm_mutex, or when dealing with a d3cold-capable device
for vram_threshold sysfs creation and setting default threshold.
Presently, all these potential failures are disregarded.

Move d3cold.lock initialization to xe_pm_init_early and cause driver
abort if mutex initialization has failed.

For xe_pm_init failures cleanup the driver and return error code

-v2
Make mutex init cleaner (Lucas)

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-8-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:30:24 -07:00
Himal Prasad Ghimiray
e3d0839aa5 drm/xe/tile: Abort driver load for sysfs creation failure
Ensure that the status of all tile associated sysfs entries creation is
relayed to xe_tile_init_noalloc, leading to a driver load abort if any
sysfs creation failures occur.

-v2
Avoid unnecessary warn/error messages. (Lucas)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-7-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:30:17 -07:00
Himal Prasad Ghimiray
9c3f72a342 drm/xe/gt: Abort driver load for sysfs creation failure
Instead of allowing the driver to load with incomplete sysfs entries in
case of sysfs creation failure, we should terminate the driver loading.
This change ensures that the status of all gt associated sysfs entries
creation is relayed to xe_gt_init, leading to a driver load abort if any
sysfs creation failures occur.

-v2
use err_force_wake label instead of new. (Lucas)
Avoid unnecessary warn/error messages. (Lucas)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-6-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:26:34 -07:00
Himal Prasad Ghimiray
6e40f142c5 drm/xe: Return NULL in case of drmm_add_action_or_reset failure
In case of drmm_add_action_or_reset failure return NULL and no need
to print warning messages as they will be printed implictly.

Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-5-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:26:34 -07:00
Himal Prasad Ghimiray
22bf0bc04d drm/xe: call free_gsc_pkt only once on action add failure
The drmm_add_action_or_reset function automatically invokes the
action (free_gsc_pkt) in the event of a failure; therefore, there's no
necessity to call it within the return check.

-v2
Fix commit message. (Lucas)

Fixes: d8b1571312b7 ("drm/xe/huc: HuC authentication via GSC")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-4-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:26:34 -07:00
Himal Prasad Ghimiray
a99641e387 drm/xe: Remove sysfs only once on action add failure
The drmm_add_action_or_reset function automatically invokes the action
(sysfs removal) in the event of a failure; therefore, there's no
necessity to call it within the return check.

Modify the return type of xe_gt_ccs_mode_sysfs_init to int, allowing the
caller to pass errors up the call chain. Should sysfs creation or
drmm_add_action_or_reset fail, error propagation will prompt a driver
load abort.

-v2
Edit commit message (Nikula/Lucas)
use err_force_wake label instead of new. (Lucas)
Avoid unnecessary warn/error messages. (Lucas)

Fixes: f3bc5bb4d53d ("drm/xe: Allow userspace to configure CCS mode")
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-3-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:26:34 -07:00
Himal Prasad Ghimiray
5a73dd61a0 drm/xe: Simplify function return using drmm_add_action_or_reset()
Instead of assigning the value of drmm_add_action_or_reset() to err and
returning err in case of failure and 0 in case of success, simply return
the result of drmm_add_action_or_reset().

-v2:
cleanup in xe_display too.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412181211.1155732-2-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-18 13:26:34 -07:00
Gustavo Sousa
cba22c911c drm/xe/xe2lpg: Extend Wa_14020338487
Wa_14020338487 also applies to Xe2_LPG. Replicate the existing entry to
one specific for Xe2_LPG.

Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417212501.312346-1-gustavo.sousa@intel.com
2024-04-18 11:01:56 -07:00
Rodrigo Vivi
f9116f658a
drm/xe: Add outer runtime_pm protection to xe_live_ktest@xe_dma_buf
Any kunit doing any memory access should get their own runtime_pm
outer references since they don't use the standard driver API
entries. In special this dma_buf from the same driver.

Found by pre-merge CI on adding WARN calls for unprotected
inner callers:

<6> [318.639739]     # xe_dma_buf_kunit: running xe_test_dmabuf_import_same_driver
<4> [318.639957] ------------[ cut here ]------------
<4> [318.639967] xe 0000:4d:00.0: Missing outer runtime PM protection
<4> [318.640049] WARNING: CPU: 117 PID: 3832 at drivers/gpu/drm/xe/xe_pm.c:533 xe_pm_runtime_get_noresume+0x48/0x60 [xe]

Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-10-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:40 -04:00
Rodrigo Vivi
e1feade077
drm/xe: Ensure all the inner access are using the _noresume variant
At this point mem_access references should be only used as inner
points of the execution and a get with synchronous resume previously
called at an outer point.

So, before killing mem_acces in favor of direct accsess, let's
ensure that we first convert them towards the new _noresume
variant that will WARN us if no inner caller happened.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-9-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:40 -04:00
Rodrigo Vivi
16b57c90bb
drm/xe: Convert mem_access_if_ongoing to direct xe_pm_runtime_get_if_active
Now that assert_mem_access is relying directly on the pm_runtime state
instead of the counters, there's no reason why we cannot use
the pm_runtime functions directly.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-8-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:40 -04:00
Rodrigo Vivi
a382291017
drm/xe: Removing extra mem_access protection from runtime pm
This is not needed any longer, now that we have all the protection
in place with the runtime pm itself.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-7-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:40 -04:00
Rodrigo Vivi
fdea94a4c2
drm/xe: Convert xe_gem_fault to use direct xe_pm_runtime calls
The gem page fault is one of the outer bound protections where
we want to ensure that the hardware is in D0 before proceeding
with memory access. Let's convert it towards the xe_pm_runtime
functions directly so we can then convert the mem_access to be
inner protection only and then Kill it for good.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-6-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:40 -04:00
Rodrigo Vivi
152c37bf40
drm/xe: Remove useless mem_access during probe
xe_pm_init is the very last thing during the xe_pci_probe(),
hence these protections are useless from the point of view
of ensuring that the device is awake.

Let's remove it so we continue towards the goal of killing
xe_device_mem_access.

v2: Adding more cases
v3: Provide a separate fix for xe_tile_init_noalloc return (Matt)
    Adding a new case where display HDCP init calls which
    are also called at display probe time.

Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-5-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:39 -04:00
Rodrigo Vivi
8ae84a2744
drm/xe: Move lockdep protection from mem_access to xe_pm_runtime
The mem_access itself is not holding any lock, but attempting
to train lockdep with possible scarring locks happening during
runtime pm. We are going soon to kill the mem_access get and put
helpers in favor of direct xe_pm_runtime calls, so let's just
move this lock around to where it now belongs.

v2: s/lockdep_training/lockdep_prime (Matt Auld)

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-4-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:39 -04:00
Rodrigo Vivi
77e619a82f
drm/i915/display: convert inner wakeref get towards get_if_in_use
This patch brings no functional change. Since at this point of
the code we are already asserting a wakeref was held, it means
that we are with runtime_pm 'in_use' and in practical terms we
are only bumping the pm_runtime usage counter and moving on.

However, xe driver has a lockdep annotation that warned us that
if a sync resume was actually called at this point, we could have
a deadlock because we are inside the power_domains->lock locked
area and the resume would call the irq_reset, which would also
try to get the power_domains->lock.

For this reason, let's convert this call to a safer option and
calm lockdep on.

v2: use _noresume variant instead of get_in_use (Ville, Imre)

Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Imre Deak <imre.deak@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:39 -04:00
Rodrigo Vivi
82e279a49a
drm/xe: Introduce intel_runtime_pm_get_noresume at compat-i915-headers for display
The i915-display will start using the intel_runtime_pm_noresume.
So we need to add the compat header before it.

Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-2-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:39 -04:00
Rodrigo Vivi
cbb6a7413b
drm/xe: Introduce xe_pm_runtime_get_noresume for inner callers
Let's ensure that we have an option for inner callers that will
raise WARN if device is not active and not protected by outer callers.

Make this also a void function forcing every caller to unconditionally
put the reference back afterwards.

This will be very important for cases where we want to hold the
reference before scheduling a work in a queue. Then the work job
will be responsible for putting it back.

While at this, already convert a case from mem_access_get_ongoing where
it is not checking for the reference and put it back, what would
cause the underflow.

v2: Fix identation.
v3: Convert equivalent missing put from mem_access towards pm_runtime.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417203952.25503-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-18 08:31:39 -04:00
Vinay Belgaumkar
2817a1f1bf drm/xe/lnl: Apply GuC Wa_13011645652
Enable WA for a bug that could cause the C6 state machine to hang
during RC6 exit.

v2: Add comment clarifying the WA (John H)
v3: Add more details to the comment (John H)

Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417054802.1766359-1-vinay.belgaumkar@intel.com
2024-04-17 15:21:12 -07:00
Matthew Auld
8eae42f175 drm/xe/vm: don't include xe_gt.h
clangd complains here, since nothing in xe_gt.h seems to be needed.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412113144.259426-6-matthew.auld@intel.com
2024-04-17 13:38:32 +01:00
Matthew Auld
5b259c0d1d drm/xe/vm: drop vm->destroy_work
Now that we no longer grab the usm.lock mutex (which might sleep) it
looks like it should be safe to directly perform xe_vm_free when vm
refcount reaches zero, instead of punting that off to some worker.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412113144.259426-5-matthew.auld@intel.com
2024-04-17 13:38:32 +01:00
Matthew Auld
83967c5732 drm/xe/vm: prevent UAF with asid based lookup
The asid is only erased from the xarray when the vm refcount reaches
zero, however this leads to potential UAF since the xe_vm_get() only
works on a vm with refcount != 0. Since the asid is allocated in the vm
create ioctl, rather erase it when closing the vm, prior to dropping the
potential last ref. This should also work when user closes driver fd
without explicit vm destroy.

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1594
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412113144.259426-4-matthew.auld@intel.com
2024-04-17 13:38:11 +01:00
Matthew Auld
48b1f11c95 drm/xe/stolen: ignore first page for FBC
We have observed underruns on some platforms if the CFB offset is within
the first page of stolen. Just like i915 skip the first page.

v2 (Maarten)
  - Also align the start.

BSpec: 50214
Reported-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412150301.273344-4-matthew.auld@intel.com
2024-04-17 13:09:47 +01:00
Matthew Auld
9890821f3e drm/xe/stolen: lower the default alignment
No need to be so aggressive here. The upper layers will already apply
the needed alignment, plus some allocations might wish to skip it. Main
issue is that we might want to have start/end bias range which doesn't
match the default alignment which is rejected by the allocator.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240412150301.273344-3-matthew.auld@intel.com
2024-04-17 13:09:46 +01:00
Lu Yao
67a9e86dc1 drm/xe: select X86_PLATFORM_DEVICES when ACPI_WMI is selected
ACPI_WMI is a subitem of X86_PLATFORM_DEVICES. And X86_PLATFORM_DEVICES
is not selected in the current Kconfig, and may cause Kconfig warnings:

WARNING: unmet direct dependencies detected for ACPI_WMI
  Depends on [n]: X86_PLATFORM_DEVICES [=n] && ACPI [=y]
  Selected by [m]:
  - DRM_XE [=m] && HAS_IOMEM [=y] && DRM [=m] && PCI [=y] && MMU [=y] &&
    (m && MODULES [=y] || y && KUNIT [=y]=y) && X86 [=y] && ACPI [=y]

Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240415025215.15811-1-yaolu@kylinos.cn
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-16 11:52:12 -07:00
John Harrison
09700beeba drm/xe/bmg: Some LNL workarounds also apply to BMG
Enable a couple of existing workarounds for a new platform.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240410002646.3002394-3-John.C.Harrison@Intel.com
2024-04-16 10:50:37 -07:00