drm/doc: Document DRM device reset expectations
Create a section that specifies how to deal with DRM device resets for kernel and userspace drivers. Signed-off-by: André Almeida <andrealmeid@igalia.com> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> Acked-by: Sebastian Wick <sebastian.wick@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230929092509.42042-1-andrealmeid@igalia.com
This commit is contained in:
parent
988d0ff29e
commit
db0f246c39
@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
|
||||
mmapped regular files. Threads cause additional pain with signal
|
||||
handling as well.
|
||||
|
||||
Device reset
|
||||
============
|
||||
|
||||
The GPU stack is really complex and is prone to errors, from hardware bugs,
|
||||
faulty applications and everything in between the many layers. Some errors
|
||||
require resetting the device in order to make the device usable again. This
|
||||
section describes the expectations for DRM and usermode drivers when a
|
||||
device resets and how to propagate the reset status.
|
||||
|
||||
Device resets can not be disabled without tainting the kernel, which can lead to
|
||||
hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
|
||||
device resets is to propagate the message to the application and apply any
|
||||
special policy for blocking guilty applications, if any. Corollary is that
|
||||
debugging a hung GPU context require hardware support to be able to preempt such
|
||||
a GPU context while it's stopped.
|
||||
|
||||
Kernel Mode Driver
|
||||
------------------
|
||||
|
||||
The KMD is responsible for checking if the device needs a reset, and to perform
|
||||
it as needed. Usually a hang is detected when a job gets stuck executing. KMD
|
||||
should keep track of resets, because userspace can query any time about the
|
||||
reset status for a specific context. This is needed to propagate to the rest of
|
||||
the stack that a reset has happened. Currently, this is implemented by each
|
||||
driver separately, with no common DRM interface. Ideally this should be properly
|
||||
integrated at DRM scheduler to provide a common ground for all drivers. After a
|
||||
reset, KMD should reject new command submissions for affected contexts.
|
||||
|
||||
User Mode Driver
|
||||
----------------
|
||||
|
||||
After command submission, UMD should check if the submission was accepted or
|
||||
rejected. After a reset, KMD should reject submissions, and UMD can issue an
|
||||
ioctl to the KMD to check the reset status, and this can be checked more often
|
||||
if the UMD requires it. After detecting a reset, UMD will then proceed to report
|
||||
it to the application using the appropriate API error code, as explained in the
|
||||
section below about robustness.
|
||||
|
||||
Robustness
|
||||
----------
|
||||
|
||||
The only way to try to keep a graphical API context working after a reset is if
|
||||
it complies with the robustness aspects of the graphical API that it is using.
|
||||
|
||||
Graphical APIs provide ways to applications to deal with device resets. However,
|
||||
there is no guarantee that the app will use such features correctly, and a
|
||||
userspace that doesn't support robust interfaces (like a non-robust
|
||||
OpenGL context or API without any robustness support like libva) leave the
|
||||
robustness handling entirely to the userspace driver. There is no strong
|
||||
community consensus on what the userspace driver should do in that case,
|
||||
since all reasonable approaches have some clear downsides.
|
||||
|
||||
OpenGL
|
||||
~~~~~~
|
||||
|
||||
Apps using OpenGL should use the available robust interfaces, like the
|
||||
extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
|
||||
interface tells if a reset has happened, and if so, all the context state is
|
||||
considered lost and the app proceeds by creating new ones. There's no consensus
|
||||
on what to do to if robustness is not in use.
|
||||
|
||||
Vulkan
|
||||
~~~~~~
|
||||
|
||||
Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
|
||||
This error code means, among other things, that a device reset has happened and
|
||||
it needs to recreate the contexts to keep going.
|
||||
|
||||
Reporting causes of resets
|
||||
--------------------------
|
||||
|
||||
Apart from propagating the reset through the stack so apps can recover, it's
|
||||
really useful for driver developers to learn more about what caused the reset in
|
||||
the first place. DRM devices should make use of devcoredump to store relevant
|
||||
information about the reset, so this information can be added to user bug
|
||||
reports.
|
||||
|
||||
.. _drm_driver_ioctl:
|
||||
|
||||
IOCTL Support on Device Nodes
|
||||
|
Loading…
x
Reference in New Issue
Block a user