IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Added function to calculate possible engines which caused
RAZWI (read-only zero, write ignored), from a given router id or
module index.
When getting RAZWI via PSOC IP, first the router id is calculated
and then the possible engines that caused the RAZWI are calculated.
There is a possibility that the RAZWI initiator is not an engine. In
that case, it will not be included in possible engines as it
doesn't have an engine id.
RAZWI information is captured when receiving event from engine or via
PSOC IP.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
'struct hl_device_reset_work' is used as a wrapper for the reset work
and its parameters, including the reset workqueue on which it runs.
In a future commit, another reset related work with similar parameters
is going to be added, but it won't use the reset workqueue.
As in any case there is a single reset workqueue, and to allow the resue
of this structure, move the reset workqueue to 'struct hl_device'.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Unregistering eventfd is for releasing host resources and doesn't
involve an access to the device. As such, there is no reason to disallow
it when device isn't operational.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If reset upon device release is enabled, there is no need to check the
device idle status in hpriv_release(), because device is going to be
reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Device unavailable notifies the user that there isn't an option to
retrieve debug information from the device.
When a critical device error occurs and the f/w performs the device
reset, a device unavailable notification shall be sent to the user
process.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Privileged MME clock configuration is removed as it is done by the f/w.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
pf was an abbreviation for prefetch but because pf already stands
for 'physical function', we decided to change it to 'prefetch'.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Only the first page fault will be saved.
Besides the address which caused the page fault, the driver captures
all of the mmu user mappings.
User can retrieve this data via the new uapi (new opcode in INFO ioctl).
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
RAZWI is optionally handled as part of the generic QM SEI error
handling, but it always uses PDMA as the module ID.
Fix it to use the suitable module ID according to the specific event.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This event notification was compatible only with gaudi, where razwi
and page fault happens together.
To make it compatible with all ASICs, this refactor contains:
1. Razwi notification will only notify about razwi info.
New notification will be added in future patch, to retrieve data
about page fault error.
2. Changed razwi info structure to support all ASICs.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Monitoring apps would like to query device state at any time so we
should allow it also during reset because it doesn't involve
accessing the h/w.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If hl_cpu_accessible_dma_pool_alloc() fails, we should check
'req_cpu_addr', fix it.
Fixes: 0c88760f8f5e ("habanalabs/gaudi2: add secured attestation info uapi")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When doing sizeof() and giving as argument a dereference of
a pointer-to-a-pointer object, clang will issue a warning.
Eliminate the warning by passing struct <name>*
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
I$ prefetch is enabled when sending a TPC kernel to initialize the TPC
memory, and it has a restriction that the base address will be aligned
to 8KB.
Currently the base address is 128 bytes from the start address of the
device SRAM, so prefetching will start 128 bytes before the actual
kernel memory.
Modify the kernel address to be 8KB aligned.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
To be forward-backward compatible with the firmware in the initial
communication during preboot, we need to remove the validation of the
header size. This will allow us to add more fields to the
lkd_fw_comms_desc structure.
Instead of the validation of the header size, we just print warning
when some mismatch in descriptor has been revealed, and we calculate
the CRC base on descriptor size reported by the firmware instead of
calculating it ourselves.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Newer ASICs code changes more often, has more chance to fail
compilation. So, let's compile them first so errors in those files
will fail compilation sooner.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order for the user to flush PCIE he needs to read some register
from PCIE block. The chosen register is SPECIAL_GLBL_SPARE_0 and
hence needs to be unsecured.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If the user wants to open the device, and the device is currently in
reset, the user will get an error from the open().
We don't need to display an error in the dmesg for that as it is
not a real error and we can spam the kernel log with this message.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The void pointer object can be directly assigned to different structure
objects, it does not need to be cast.
Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
User will provide a nonce via the ioctl, and will retrieve
secured attestation data of the boot, generated using given
nonce.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order to get the error cause and the captured address in case of
page fault, added pmmu events to eqe handler.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This change is done while there is a problem to use QMAN error for
TPC assert async. The problem involves security limitation that exists
to generate the assert via QMAN error.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As a preparation for adding more errors to it,
change to more suitable name.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Get the firmware reset status address from the dynamic registers
we read from the firmware instead of using a define.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The access to the device registers is blocked during hard reset, until
preboot runs and allows the access to specific registers, including the
PSOC BTM_FSM register which is used to know when the reset is done.
Between the reset request and until this register is polled there is a
small delay of 500 msec which is not enough for F/W to process the reset
and for preboot to run, so the register might be accessed while it is
blocked.
To avoid it, increase the delay to 2 sec.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add the dump of the RAZWI information when a PCIe access is blocked by
RR.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The code used the mmu mutex to protect access to the context's page
tables and invalidation of the MMU cache. Because pgt are per
context, the mmu mutex was a member of the context object.
The problem is that the device has a single MMU invalidation h/w
(per MMU). Therefore, the mmu mutex should not be a property of the
context but a property of the device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add new notifier events that inform several device states.
General H/W error raised on device general H/W error occurs.
User engine error is raised when a device engine informs of an error.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Current code does not takes into account the new DRAM region base
and so calculated address is wrong and can lead to crush.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Firmware now responds with a more detailed cpucp return codes.
Driver can now distinguish between error and debug return codes.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
F/W security status might change after every reset.
Add the reading of the preboot status to the hard reset sequence, which
among others reads this security indication.
As this preboot status reading includes the waiting for the preboot to
be ready, it can be removed from the CPU init which is done in a later
stage.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Current description is misleading hence we rename it to a more
suitable error description.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
cb_map_mem() uses gen_pool_alloc() to get virtual address for
mapping a CB.
The mapping is done in chunks of page size, so if the CB size is
larger, it is possible that the allocated virtual addresses won't
be consecutive.
User retrieves this device VA which returns the virtual address
in the first va_block. If there is a "hole" in the virtual addresses,
user can configure a HW block with a bad device VA.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
'Device activity open packet' should be sent outside of mutex as
there is no real necessity for a lock.
In addition 'device activity close packet' should be sent upon an
actual release of the device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As part of the RAS that is done by the f/w, we should send a message
to the f/w when a user either acquires or releases the device.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order to improve debuggability, we add all available information
when a RAZWI event occur.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
When we have a storm of errors of HBM ECC SERR we can reach a situation
where driver start hard reset flow without logging the error cause
that caused the hard reset due to logs rate limiting.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
EEPROM errors reported by firmware are basically warnings and
should not fail the boot process.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Except Goya, none of our ASICs require context switch flow, hence we
enable this flow only where it is needed.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Set the addresses for userspace command buffer dynamically
instead of hard-coded. There is no reason for it to
be hard-coded.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This patch add tracepoints in the code for DMA allocation.
The main purpose is to be able to cross data with the map operations and
determine whether memory violation occurred, for example free DMA
allocation before unmapping it from device memory.
To achieve this the DMA alloc/free code flows were refactored so that a
single DMA tracepoint will catch many flows.
To get better understanding of what happened in the DMA allocations
the real allocating function is added to the trace as well.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This patch utilize the defined tracepoint to trace the MMU's pages
map/unmap operations.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This patch adds trace events for habanalabs driver to gain all the
benefits such an infrastructure can supply.
The following events were added:
- MMU map/unmap: to be able to track driver's memory allocations
- DMA alloc/free: to track our DMA allocation
the above trace points in conjunction will help us map the device memory
usage as well as to be able to track memory violations.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The calculation of the device DRAM base address before setting the
relevant PCIe BAR to point at it, has an assumption that this BAR is
used to access only the DRAM, and thus the covered DRAM size is a power
of 2.
In future ASICs it is not necessarily true, so need to update the
calculation to support also a non-power-of-2 size.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>