IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
To improve the user's ability to debug the case where a workload that
is part of executing training/inference of a topology is getting stuck,
we need to add a 'core dump' each time a CS times-out. The 'core dump'
shall contain all relevant Sync Manager information and corresponding
fence values.
The most recent dumps shall be accessible via debugfs, under
'state_dump' node. Reading from the node will provide the oldest dump
available. Writing an integer value X will discard X dumps, starting
with the oldest one, i.e. subsequent read will now return newer
dumps.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The previous function we used, find_get_pid(), wasn't good in case
the user process was run inside docker.
As a result, we didn't had the PID and we couldn't kill the user
process in case the device got stuck and we needed to reset the
device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Sometimes we may need to disable optimization of using huge pages
in our memory management code. Add such a flag to the function that
creates the list of physical pages that would be programmed into the
device MMU.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Increase the size variable in the userptr structure to 64-bit. That
variable describes the size of the memory allocation of the user that
is now being mapped into the device. The mapping can be larger than
4GB, so we need to support it.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Same as we handle it in the regular wait for CS, we need to handle the
case where the waiting for user interrupt was interrupted. In that case,
we need to return correct error code to the user.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In device fini there was missing a call to release all pending user
interrupts. That can cause a process to be stuck inside the driver's
IOCTL of wait for interrupts, in case the device is removed or
simulator is killed at the same time.
In addition, also call to remove inactive codec job was missing.
Moreover, to prevent such errors in the future (where code is added
to reset path but not to device fini), we moved some common parts
to two dedicated functions:
cleanup_resources
take_release_locks
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In case user interrupt arrived but the completion value is less than
the target value, we want to retry the wait.
However, before the retry we must reinitialize the completion object,
under spin-lock, so the wait function won't exit immediately because
the completion object is already completed (from the previous
interrupt).
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Update recent changes made in firmware header files, which contain
a minor COMMS protocol change and new error status definitions.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
A new user flag is required to make memory map hint mandatory, in
contrast to the current situation where it is best effort.
This is due to the requirement to map certain data to specific
pre-determined device virtual address ranges.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add support for pre-determined driver-reserved device VA address ranges.
This is needed for future ASIC support where some contents must be
mapped into these pre-determined ranges because the H/W will be
configured using these ranges.
In case the user asks to map a VA without a hint address, avoid
allocating the device VA from the reserved ranges.
Make sure the validation checks of the hint address take into account
situation where the DRAM page size is not pow of 2.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
There is code related to hard-reset, which is done in gaudi specific
code. However, this code can be used by future ASICs and therefore it
is better to move it to the common code section.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We add support for NIC DERR ECC error events, in case this error
is received a device reset will be performed.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In preparation for a new feature that allows the user to reserve
signals ahead of submissions, we need to change a current assumption
in the code.
Currently, the driver uses 2 SOBs to support signal CS. When the first
SOB reaches max value, the driver switches to the other one and assumes
that when it will need to switch back to the first one, all of the
signals have already been handled.
This assumption won't hold when the new feature will be added, because
using signal reservation, the driver can reach the max SOB value very
fast.
The change is to add a validity check when submitting a signal CS, to
make sure the previous SOB is available (all the signals attached to
it indeed finished).
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We introduce a new type of reset which is reset upon device release.
This reset is very similar to soft reset except the fact it is
performed only upon device release and not upon user sysfs request
nor TDR.
The purpose of this reset is to make sure the device is returned to
IDLE state after the current user has finished working with the device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
To be able to debug long-running CS better, without changing the
userspace code, we are adding a new option through debugfs interface
to skip the reset of the device in case of CS timeout.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently driver sends fc interrupt id to FW instead of using
cpu interrupt id. We intend to fix that and keep backward
compatibility by using the same interrupt values.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
There was a rogue #ifdef that crept into the upstream code for
backwards compatibility which isn't needed of course.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In case QMAN has an error and stop_on_err is true, print specific
information of the "offending" command buffer batch.
If the error occurred on one of the higher CPs, the CQ pointer and size
will be printed along with (up to) last 8 PQEs of the stream.
If the error occurred in the lower CP, the CQ pointer and size will be
printed along with (up to) last 8 PQEs of ALL upper CPs as we have no
way to know which upper CP sent the job there.
This is done so higher SW levels will be able to debug their CS by
extracting the raw data of the offending command buffer batch and
examine those offline to detect the issue.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In a system with multiple ASICs, there is a need to provide monitoring
tools with information on how long a device was opened and how many
times a device was opened.
Therefore, we add a new opcode to the INFO ioctl to provide that
information.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Update STMTCSR and STMSYNCR values in order to reduce amount of sync
packets
Signed-off-by: Tal Albo <talbo@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
fix the following smatch warnings:
goya_pin_memory_before_cs()
warn: '&userptr->job_node' not removed from list
gaudi_pin_memory_before_cs()
warn: '&userptr->job_node' not removed from list
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
pin_user_pages_fast() might fail and return a negative number, or pin
less pages than requested and return the number of the pages that were
pinned.
For the latter, it is informative to print also the memory size and the
number of requested pages.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it
must be undone by a corresponding 'pci_disable_pcie_error_reporting()'
call, as already done in the remove function.
Fixes: 2e5eda4681f9 ("habanalabs: PCIe Advanced Error Reporting support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Firmware in habanalabs devices is composed of several components.
During device initialization, we read these versions from the device.
Print them during device initialization to allow better visibility in
automated systems.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Hard reset flow on PLDM might take more than 2 minutes.
Hence add a dedicated hard reset timeout of 6 minutes for PLDM.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In current code, for dynamic f/w loading flow, DRAM scrambling is
enabled post Linux fit image is loaded to the card. This can cause the
device CPU to go into reset state.
The correct sequence should be:
1. Load boot fit image
2. Enable scrambling
3. Load Linux fit image
This commit aligns the DRAM scrambling enabling with the static f/w load
flow.
Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If there is an error in the QMAN/engine, there is no point of trying
to continue running the workload. It is better to stop to allow the
user to debug the program.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In case we have EQ fault we would like to know about it.
For this, a status bitmask was added in which EQ_FAULT bit is
set by FW in case of EQ fault.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
When there is an ECC error in the HBM, return a standard error code,
-EIO in this case, and not a positive value.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
When converting virtual address to physical we need to add correct
offset to the physical page.
For this we need to use mask that include ALL bits of page offset.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Get rid of the need to check if boot_dev_sts is valid on every access
to value read from these registers.
This is done by storing the register value in hdev props ONLY if
register is enabled.
This way if register is NOT enabled all capability bits will not be set.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If device is not idle after user closes the FD we must reset device
as next user that will try to open FD will encounter a non-functional
device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Sometimes it is useful to allow the command to continue running despite
the timeout occurred, to differentiate between really stuck or just very
time consuming commands. This can be achieved by passing a new debug
flag alongside the cs, HL_CS_FLAGS_SKIP_RESET_ON_TIMEOUT.
Anyway, if the timeout occurred, a warning print shall be issued,
however this shall not fail the submission.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order for driver to be aware of process or thread crashes inside
GAUDI's CPU, we introduce a new event which contains all relevant
information. Upon event reception, driver will dump information and
will reset the device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In the collective wait, we put jobs on the QMANs of all the NICs. The
code takes into account if a port is disabled only in case of PCI card.
When this info arrives from the f/w, the code doesn't take it into
account, and it tries to schedule jobs on NICs that aren't enabled and
thats a bug.
To fix this, after the f/w sends us the list of disabled ports, we
update the state of the QMANs according to that list. In addition,
we need to update the HW_CAP bits so the collective wait operation
will not try to use those QMANs. We also need to update the collective
master monitor mask.
Moreover, we need to add a protection for such future cases and in case
the user will try to submit work to those QMANs.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Current implementation uses a single interrupt interface towards
FW, this interface is causing races between interrupt types.
We split this interface to interface per interrupt type.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
There is no dependency when probing multiple devices so indicate to the
kernel that it can probe our devices in ASYNC fashion.
This shortens insmod of the driver from ~2 minutes to 20 seconds on
a system with 8 devices.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Update the QM stop on error masks to also stop on ARB errors.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
nic_ports_mask is used by the networking part of the driver.
In the compute part, we use the HW_CAP bits to select what is active
and what is not.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This fix was applied since there was an incorrect reported CPU ID to GIC
such that an error in MME2 QMAN aliased to be an arriving from DMA0_QM.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
After all the latest changes to the reset code, there were some
redundancy and errors in the flows.
If the Linux FIT is loaded to the ASIC CPU, we need to communicate
with it only via GIC. If it is not loaded, we need to either use
COMMS protocol (for newer f/w) or MSG_TO_CPU register (for older f/w).
In addition, if we halted the device CPU then we need to mark that
the driver will do the reset, regardless of the capabilities.
Also, to prevent false errors, we need to keep track whether the
device CPU was already halted. If so, we shouldn't try to halt it
again.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>