123 Commits

Author SHA1 Message Date
Oded Gabbay
259cee1c24 habanalabs: eliminate aggregate use warning
When doing sizeof() and giving as argument a dereference of
a pointer-to-a-pointer object, clang will issue a warning.

Eliminate the warning by passing struct <name>*

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20 15:52:27 +03:00
Dani Liberman
97a78e3d8e habanalabs: rename error info structure
As a preparation for adding more errors to it,
change to more suitable name.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Ofir Bitton
c38f72370b habanalabs: perform context switch flow only if needed
Except Goya, none of our ASICs require context switch flow, hence we
enable this flow only where it is needed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Tal Cohen
194e515c79 habanalabs/gaudi2: new API to control engine cores running mode
The current flow of halting the engine cores is implemented by command
buffers built by the user space and sent towards the Driver.

This current flow is broken since the user space does not know when
the cores actually halt as sending a workload is async op.

Therefore the application can not free the memory that is mapped
to the engine cores.

This new API allows the user space to control the running mode. The
API call is sync (returns after the cores are set to the
requested mode).

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Tal Cohen
0c876b47a5 habanalabs: fix command submission sanity check
When a CS is submitted, the ioctl handler checks the CS
flags and performs a sanity check, according to its value.
As new CS flags are added, the sanity check needs to be updated
according to the new flags.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Oded Gabbay
cf008f5acb habanalabs: make sure variable is set before used
timestamp could be unset in both _hl_interrupt_wait_ioctl() and
_hl_interrupt_wait_ioctl_user_addr() so it is better to explicitly
initialize it to 0 when declaring it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Oded Gabbay
f2d9ec872c habanalabs: don't declare tmp twice in same function
tmp is declared in the scope of the function cs_do_release() and
inside a block inside that function.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Ofir Bitton
d6a66d5960 habanalabs: add support for common decoder interrupts
User application should be able to get notification for any decoder
completion. Hence, we introduce a new interface in which a user
can wait for all current decoder pending interrupts.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Ofir Bitton
1a6609cdd4 habanalabs: naming refactor of user interrupt flow
Current naming convention can be misleading. Hence renaming some
variables and defines in order to be more explicit.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Oded Gabbay
f73c637645 habanalabs: add gaudi2 wait-for-CS support
In Gaudi2 we moved to a different wait for command submission
completion model. Instead of receiving interrupt only on external
queues, we use the device's sync manager to notify us when the
entire command submission finishes.

This enables us to remove the categorization of queues to external
and internal, and treat each queue equally, without the need to parse
and patch any command buffer.

This change also requires refactoring to the IRQ handling of
CS completions.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Oded Gabbay
d7bb1ac89b habanalabs: add gaudi2 asic-specific code
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the
function callbacks that the driver's common code need to initialize,
finalize and submit workloads to the Gaudi2 ASIC.

It also contains the code to initialize the F/W of the Gaudi2 ASIC
and to receive events from the F/W.

It contains new debugfs entry to dump razwi events. razwi is a case
where the device's engines create a transaction that reaches an
invalid destination.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
b63539a6fa habanalabs: print pointer with correct modifier
Use %p instead of %llx for printing pointers.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:24 +03:00
Oded Gabbay
abe85a9c11 habanalabs: check fence pointer before use
fence pointer can be NULL in this path, as shown by an earlier check.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:23 +03:00
Yuri Nudelman
a189977701 habanalabs: fix NULL dereference on cs timeout
Device descriptor is accessed before an assignment

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:23 +03:00
farah kassabri
d64a29af12 habanalabs: add validity check for cq counter offset
Driver performs no validity check for the user cq counter offset
used in both wait_for_interrupt and register_for_timestamp APIs.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:23 +03:00
Tal Cohen
fa9deaca2f habanalabs: send an event notification when CS timeout occurs
The Driver needs to inform the User process whenever one of its
CS is timed out. The Driver shall recognize the CS timeout and shall
send an eventfd notification, towards user space, whenever a timeout
is expired on a CS.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:22 +03:00
Yuri Nudelman
2bc61bc4f3 habanalabs: keep a record of completed CS outcomes
Often, the user is not interested in the completion timestamp of all
command submissions.
A common situation is, for example, when the user submits a burst of,
possibly, several thousands of commands, then request the completion
timestamp of only couple of specific key commands from all the burst.
The problem is that currently, the outcome of the early commands may be
lost, due to a large amount of later commands, that the user does not
really care about.

This patch creates a separate store with the outcomes of commands the
user has mark explicitly as interested in. This store does not mix the
marked commands with the unmarked ones, hence the data there will
survive for much longer.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:22 +03:00
Tal Cohen
d0c92afc0e habanalabs: change the write flag name of error info structs
positive flags naming will make more clear code while adding
more 'error info' structures

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:21 +03:00
Tal Cohen
90de680526 habanalabs: use separate structure info for each error collect data
Create separate info structure for each error type.
The structures shall be used inside the large structure that contains
the last session error.
This is more scalable for adding more errors in the future.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Ohad Sharabi
e31dd9362f habanalabs: remove hdev from hl_ctx_get args
This argument is unused by the function.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Yuri Nudelman
829ec038c9 habanalabs: use unified memory manager for CB flow
With the new code required for the flow added, we can now switch
to using the new memory manager infrastructure, removing the old code.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Yuri Nudelman
4e63ce6af6 habanalabs: hide memory manager page shift
The new unified memory manager uses page offset to pass buffer handle
during the mmap operation. One problem with this approach is that it
requires the handle to always be divisible by the page size, else, the
user would not be able to pass it correctly as an argument to the mmap
system call.

Previously, this was achieved by shifting the handle left after alloc
operation, and shifting it right before get operation. This was done in
the user code. This creates code duplication, and, what's worse,
requires some knowledge from the user regarding the handle internal
structure, hurting the encapsulation.

This patch encloses all the page shifts inside memory manager functions.
This way, the user can take the handle as a black box, and simply use
it, without any concert about how it actually works.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Rajaravi Krishna Katta
b31848430f habanalabs: fix comments according to kernel-doc
Incorrect/Missing doxygen tag

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Yuri Nudelman
4d530e7d12 habanalabs: convert ts to use unified memory manager
With the introduction of the unified memory manager infrastructure, the
timestamp buffers can be converted to use it.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Jakob Koschel
9138c24244 habanalabs: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].

This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
farah kassabri
a78b07dcae habanalabs: Fix reset upon device release bug
In case user application was interrupted while some cs still in-flight
or in the middle of completion handling in driver, the
last refcount of the kernel private data for the user process
will not be put in the fd close flow, but in the cs completion
workqueue context.

This means that the device reset-upon-device-release will be called
from that context. During the reset flow, the driver flushes all the cs
workqueue to ensure that any scheduled work has run to completion,
and since we are running from the completion context we will
have deadlock.

Therefore, we need to skip flushing the workqueue in those cases.
It is safe to do it because the user won't be able to release the device
unless the workqueues are already empty.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:06 +02:00
farah kassabri
9158bf69e7 habanalabs: Timestamps buffers registration
Timestamp registration API allows the user to register
a timestamp record event which will make the driver set
timestamp when CQ counter reaches the target value
and write it to a specific location specified
by the user.
This is a non blocking API, unlike the wait_for_interrupt
which is a blocking one.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Dani Liberman
b32cd10480 habanalabs: fix race when waiting on encaps signal
Scenario:
1. CS which is part of encaps signal has been completed and now
executing kref_put to its encaps signal handle. The refcount of the
handle decremented to 0, and called the encaps signal handle
release function - hl_encaps_handle_do_release.

2. At this point the user starts waiting on the signal, and finds the
encaps signal handle in the handlers list and increment the habdle
refcount to 1.

3. Immediately after, hl_encaps_handle_do_release removed the handle
from the list and free its memory.

4. Wait function using the handle although it has been freed.

This scenario caused the slab area which was previously allocated
for the handle to be poison overwritten which triggered kernel bug
the next time the OS needed to allocate this slab.

Fixed by getting the refcount of the handle only in case it is not
zero.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay
d2cfd6897c habanalabs: remove duplicate print
We print detailed messages inside the internal ioctl functions. No need
to print a generic message at the end, it doesn't add any information.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay
7a78d4d481 habanalabs: fix race between wait and irq
There is a race in the user interrupts code, where between checking
the target value and adding the new pend to the list, there is a chance
the interrupt happened.

In that case, no one will complete the node, and we will get a timeout
on it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay
54faa5607b habanalabs: fix user interrupt wait when timeout is 0
When timeout is 0, we need to return the busy status in case the
target value wasn't reached upon entry to the ioctl.

Also return the correct timestamp.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay
e24a62cb68 habanalabs: there is no kernel TDR in future ASICs
In future ASICs, there is no kernel TDR for new workloads that are
submitted directly from user-space to the device.

Therefore, the driver can NEVER know that a workload has timed-out.

So, when the user asks us to wait for interrupt on the workload's
completion, and the wait has timed-out, it doesn't mean the workload
has timed-out. It only means the wait has timed-out, which is NOT an
error from driver's perspective.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Tomer Tayar
aff5d9d378 habanalabs: check the return value of hl_cs_poll_fences()
As part of handling of the multi-CS wait ioctl, hl_cs_poll_fences() is
called in a "while (true)" loop. This function can fail, but the
checking of its return value was missed.
Add this check and exit the loop in case of a failure.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Ofir Bitton
eb13529191 habanalabs: refactor reset information variables
Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:41:28 +02:00
Ohad Sharabi
60bf3bfb5a habanalabs: handle skip multi-CS if handling not done
This patch fixes issue in which we have timeout for multi-CS although
the CS in the list actually completed.

Example scenario (the two threads marked as WAIT for the thread that
handles the wait_for_multi_cs and CMPL as the thread that signal
completion for both CS and multi-CS):
1. Submit CS with sequence X
2. [WAIT]: call wait_for_multi_cs with single CS X
3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS
           (multi_cs_completion_done still false)
4. [WAIT]: enter poll_fences, reinit the completion and find the CS
           as completed when asking on the fence but multi_cs_done is
	   still false it returns that no CS actually completed
5. [CMPL]: set multi_cs_handling_done as true
6. [WAIT]: wait for completion but no CS to awake the wait context
           and hence wait till timeout

Solution: if CS detected as completed in poll_fences but multi_cs_done
          is still false invoke complete_all to the multi-CS completion
	  and so it will not go to sleep in wait_for_completion but
	  rather will have a "second chance" to wait for
	  multi_cs_completion_done.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:40:25 +02:00
farah kassabri
b9d31cada7 habanalabs: change wait_for_interrupt implementation
Currently the cq counters are allocated in userspace memory,
and mapped by the driver to the device address space.

A new requirement that is part of new future API related to this one,
requires that cq counters will be allocated in kernel memory.

We leverage the existing cb_create API with KERNEL_MAPPED flag set to
allocate this memory.

That way we gain two things:
1. The memory cannot be freed while in use since it's protected
by refcount in driver.

2. No need to wake up the user thread upon each interrupt from CQ,
because the kernel has direct access to the counter. Therefore,
it can make comparison with the target value in the interrupt
handler and wake up the user thread only if the counter reaches the
target value. This is instead of waking the thread up to copy counter
value from user then go sleep again if target value wasn't reached.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ohad Sharabi
e2558f0f84 habanalabs: prevent wait if CS in multi-CS list completed
By the original design we assumed that if we "miss" multi CS completion
it is of no severe consequence as we'll just call wait_for_multi_cs
again.

Sequence of events for such scenario:
1. user submit CS with sequence N
2. user calls wait for multi-CS with only CS #N in the list
3. the multi CS call starts with poll of the CSs but find that none
   completed (while CS #N did not completed yet)
4. now, multi CS #N complete but multi CS CTX was not yet created for
   the above multi-CS. so, attempt to complete multi-CS fails (as no
   multi CS CTX exist)
5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and
   for this create the multi-CS CTX)
6. wait_for_multi_cs wits on completion but will not get one as CS #N
   already completed

To fix the issue we initialize the multi-CS CTX prior polling the
fences.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ohad Sharabi
b02220536c habanalabs: wait again for multi-CS if no CS completed
The original multi-CS design assumption that stream masters are used
exclusively (i.e. multi-CS with set of stream master QIDs will not get
completed by CS not from the multi-CS set) is inaccurate.

Thus multi-CS behavior is now modified not to treat such case as an
error.

Instead, if we have multi-CS completion but we detect that no CS from
the list is actually completed we will do another multi-CS wait (with
modified timeout).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay
357ff3dc9a habanalabs: save ctx inside encaps signal
Compute context pointer in hdev shouldn't be used for fetching the
context's pointer.

If an object needs the context's pointer, it should get it while
incrementing its kref, and when the object is released, put it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay
fee187fe46 habanalabs: free signal handle on failure
Fix a bug where in case of failure to allocate idr, the handle's
memory wasn't freed as part of the error handling code.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Dani Liberman
2487f4a281 habanalabs: enable access to info ioctl during hard reset
Because info ioctl is used to retrieve data, some of its opcodes may be
used during hard reset.
Other ioctls should be blocked while device is not operational.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman
1880f7acd7 habanalabs: add SOB information to signal submission uAPI
For debug purpose, add SOB address and SOB initial counter value
before current submission to uAPI output.

Using SOB address and initial counter, user can calculate how much of
the submmision has been completed.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman
3beaf903a3 habanalabs: fix race condition in multi CS completion
Race example scenario:
1. User have 2 threads that waits on multi CS:
   - thread_0 waits on QID 0 and uses multi CS context 0.
   - thread_1 waits on QID 1 and uses multi CS context 1.
2. thread_1 got completion and release multi CS context 1.
3. CS related to multi CS of thread_0 starts executing
   complete_multi_cs function, the first iteration of the loop
   completes the multi CS of thread_0, hence multi CS context 0
   is released.
4. thread_1 waits on QID 1 and uses multi CS context 0.
5. thread_0 waits on QID 0 and uses multi CS context 1.
6. The second iterattion of the loop (from step 3) starts, which
   means, start checking multi CS context 1:
   - multi CS contetxt is being used by thread_0 waiting on QID 0.
   - The fence of the CS (still CS from step 3) has QID map the same
     as the multi CS context 1.
   - multi CS context 1 (thread_0) gets completion on CS that triggered
     already thread_0 (with multi CS context 0) and is no longer
     being waited on.

Fixed by exiting the loop in complete_multi_cs after getting completion

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman
3e55b5dbf9 habanalabs: add support for fetching historic errors
A new uAPI is added for debug purposes of the user-space to retrieve
errors related data from previous session (before device reset was
performed).

Inforamtion is filled when a razwi or CS timeout happens and can
contain one of the following:

1. Retrieve timestamp of last time the device was opened and razwi or
   CS timeout happened.
2. Retrieve information about last CS timeout.
3. Retrieve information about last razwi error.

This information doesn't contain user data, so no danger of data
leakage between users.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Bharat Jauhari
d4194f2140 habanalabs: refactor wait-for-user-interrupt function
Refactor the wait-for-user-interrupt routine to make it more
generic for re-use for other user exposed h/w interfaces in future
ASICs.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Dani Liberman
48f3116983 habanalabs: change wait for interrupt timeout to 64 bit
In order to increase maximum wait-for-interrupt timeout, change it
to 64 bit variable. This wait is used only by newer ASICs, so no
problem in changing this interface at this time.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Bharat Jauhari
234caa5273 habanalabs: rename reset flags
Rename reset flags for better readability as compared to
HL_RESET_CAUSE* enum shared with the f/w.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Bharat Jauhari
1388582264 habanalabs: handle abort scenario for user interrupt
In case of device reset, the driver does a force trigger on all waiting
users to release them from waiting. However, the driver does not handle
error scenario while waiting.

hl_interrupt_wait_ioctl() now exits the wait in case of an error with
abort status.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Dani Liberman
b2faac3887 habanalabs: refactor fence handling in hl_cs_poll_fences
To avoid checking if fence exists multipled times, changed fence
handling to depend only on the fence status field:

Busy, which means CS still did not completed :
	Add its QID so multi CS wait on its completion.
Finished, which means CS completed and fence exists:
	Raise its completion bit if it finished mcs handling and
	update if necessary the earliest timestamp.
Gone, which means CS already completed and fence deleted:
	Update multi CS data to ignore timestamp and raise its
	completion bit.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:48 +03:00
Yuri Nudelman
d2f5684b8f habanalabs: simplify wait for interrupt with timestamp flow
Remove the flag that determines whether to take a timestamp once the
interrupt arrives.
Instead, always take the timestamp once per interrupt.
This is a must for the user-space to measure its graph operations
to evaluate the graph computation time.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00