linux

iv/linux

Author	SHA1	Message	Date
Yuri Nudelman	938b793fde	habanalabs: expose state dump To improve the user's ability to debug the case where a workload that is part of executing training/inference of a topology is getting stuck, we need to add a 'core dump' each time a CS times-out. The 'core dump' shall contain all relevant Sync Manager information and corresponding fence values. The most recent dumps shall be accessible via debugfs, under 'state_dump' node. Reading from the node will provide the oldest dump available. Writing an integer value X will discard X dumps, starting with the oldest one, i.e. subsequent read will now return newer dumps. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:46 +03:00
Oded Gabbay	e79e745b20	habanalabs: use get_task_pid() to take PID The previous function we used, find_get_pid(), wasn't good in case the user process was run inside docker. As a result, we didn't had the PID and we couldn't kill the user process in case the device got stuck and we needed to reset the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	fbcd0efefc	habanalabs: allow disabling huge page use Sometimes we may need to disable optimization of using huge pages in our memory management code. Add such a flag to the function that creates the list of physical pages that would be programmed into the device MMU. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	00ce06539c	habanalabs: user mappings can be 64-bit Increase the size variable in the userptr structure to 64-bit. That variable describes the size of the memory allocation of the user that is now being mapped into the device. The mapping can be larger than 4GB, so we need to support it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	429d77ca27	habanalabs: handle case of interruptable wait Same as we handle it in the regular wait for CS, we need to handle the case where the waiting for user interrupt was interrupted. In that case, we need to return correct error code to the user. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	b07e6c7ef5	habanalabs: release pending user interrupts on device fini In device fini there was missing a call to release all pending user interrupts. That can cause a process to be stuck inside the driver's IOCTL of wait for interrupts, in case the device is removed or simulator is killed at the same time. In addition, also call to remove inactive codec job was missing. Moreover, to prevent such errors in the future (where code is added to reset path but not to device fini), we moved some common parts to two dedicated functions: cleanup_resources take_release_locks Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	d5546d78ad	habanalabs: re-init completion object upon retry In case user interrupt arrived but the completion value is less than the target value, we want to retry the wait. However, before the retry we must reinitialize the completion object, under spin-lock, so the wait function won't exit immediately because the completion object is already completed (from the previous interrupt). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Oded Gabbay	82629c71c2	habanalabs: rename enum vm_type_t to vm_type We don't use typedefs so the enum name shouldn't end with _t Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:45 +03:00
Ofir Bitton	c67b0579b8	habanalabs: update firmware header files Update recent changes made in firmware header files, which contain a minor COMMS protocol change and new error status definitions. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:44 +03:00
Yuri Nudelman	486e19795f	habanalabs: allow fail on inability to respect hint A new user flag is required to make memory map hint mandatory, in contrast to the current situation where it is best effort. This is due to the requirement to map certain data to specific pre-determined device virtual address ranges. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:44 +03:00
farah kassabri	1ae32b9094	habanalabs: support hint addresses range reservation Add support for pre-determined driver-reserved device VA address ranges. This is needed for future ASIC support where some contents must be mapped into these pre-determined ranges because the H/W will be configured using these ranges. In case the user asks to map a VA without a hint address, avoid allocating the device VA from the reserved ranges. Make sure the validation checks of the hint address take into account situation where the DRAM page size is not pow of 2. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-08-29 09:47:44 +03:00
Koby Elbaz	b7a71fddc0	habanalabs/gaudi: refactor hard-reset related code There is code related to hard-reset, which is done in gaudi specific code. However, this code can be used by future ASICs and therefore it is better to move it to the common code section. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-21 10:21:51 +03:00
Ofir Bitton	6c31f494d8	habanalabs/gaudi: add support for NIC DERR We add support for NIC DERR ECC error events, in case this error is received a device reset will be performed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-21 10:21:28 +03:00
farah kassabri	3817b352aa	habanalabs: add validity check for signal cs In preparation for a new feature that allows the user to reserve signals ahead of submissions, we need to change a current assumption in the code. Currently, the driver uses 2 SOBs to support signal CS. When the first SOB reaches max value, the driver switches to the other one and assumes that when it will need to switch back to the first one, all of the signals have already been handled. This assumption won't hold when the new feature will be added, because using signal reservation, the driver can reach the max SOB value very fast. The change is to add a validity check when submitting a signal CS, to make sure the previous SOB is available (all the signals attached to it indeed finished). Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-21 10:16:52 +03:00
Koby Elbaz	69dbbbadad	habanalabs: get lower/upper 32 bits via masking fix multiple similar occurrences of the following sparse warning: 'warning: cast truncates bits from constant value (7ffc113000 becomes fc113000)' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-21 10:16:29 +03:00
Ofir Bitton	23bace677a	habanalabs: allow reset upon device release We introduce a new type of reset which is reset upon device release. This reset is very similar to soft reset except the fact it is performed only upon device release and not upon user sysfs request nor TDR. The purpose of this reset is to make sure the device is returned to IDLE state after the current user has finished working with the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-21 10:14:34 +03:00
Yuri Nudelman	4d041216c8	debugfs: add skip_reset_on_timeout option To be able to debug long-running CS better, without changing the userspace code, we are adding a new option through debugfs interface to skip the reset of the device in case of CS timeout. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:43 +03:00
Zvika Yehudai	38e19d0b87	habanalabs: fix typo fix a spelling error in comment Signed-off-by: Zvika Yehudai <zyehudai@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Ofir Bitton	7d5ba005cf	habanalabs/gaudi: correct driver events numbering Currently driver sends fc interrupt id to FW instead of using cpu interrupt id. We intend to fix that and keep backward compatibility by using the same interrupt values. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Oded Gabbay	5bdc657320	habanalabs: remove a rogue #ifdef There was a rogue #ifdef that crept into the upstream code for backwards compatibility which isn't needed of course. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Ohad Sharabi	2718e1d322	habanalabs/gaudi: print last QM PQEs on error In case QMAN has an error and stop_on_err is true, print specific information of the "offending" command buffer batch. If the error occurred on one of the higher CPs, the CQ pointer and size will be printed along with (up to) last 8 PQEs of the stream. If the error occurred in the lower CP, the CQ pointer and size will be printed along with (up to) last 8 PQEs of ALL upper CPs as we have no way to know which upper CP sent the job there. This is done so higher SW levels will be able to debug their CS by extracting the raw data of the offending command buffer batch and examine those offline to detect the issue. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Koby Elbaz	f18cb6b58e	habanalabs/goya: add '__force' attribute to suppress false alarm fix (suppress) the following sparse warnings: 'warning: cast removes address space of expression' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Yuri Nudelman	e307b302be	habanalabs: added open_stats info ioctl In a system with multiple ASICs, there is a need to provide monitoring tools with information on how long a device was opened and how many times a device was opened. Therefore, we add a new opcode to the INFO ioctl to provide that information. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Koby Elbaz	1f7ef4bf41	habanalabs/gaudi: set the correct rc in case of err fix the following smatch warnings: gaudi_internal_cb_pool_init() warn: missing error code 'rc' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Tal Albo	ba662265fe	habanalabs/gaudi: update coresight configuration Update STMTCSR and STMSYNCR values in order to reduce amount of sync packets Signed-off-by: Tal Albo <talbo@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Koby Elbaz	f5eb7bf0c4	habanalabs: remove node from list before freeing the node fix the following smatch warnings: goya_pin_memory_before_cs() warn: '&userptr->job_node' not removed from list gaudi_pin_memory_before_cs() warn: '&userptr->job_node' not removed from list Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Koby Elbaz	11d5cb8b95	habanalabs: set rc as 'valid' in case of intentional func exit fix the following smatch warnings: hl_fw_static_init_cpu() warn: missing error code 'rc' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Koby Elbaz	b538888c3e	habanalabs: zero complex structures using memset fix the following sparse warnings: 'warning: Using plain integer as NULL pointer' 'warning: missing braces around initializer' Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Tomer Tayar	f5d6e39eb2	habanalabs: print more info when failing to pin user memory pin_user_pages_fast() might fail and return a negative number, or pin less pages than requested and return the number of the pages that were pinned. For the latter, it is informative to print also the memory size and the number of requested pages. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Christophe JAILLET	3002f467a0	habanalabs: Fix an error handling path in 'hl_pci_probe()' If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it must be undone by a corresponding 'pci_disable_pcie_error_reporting()' call, as already done in the remove function. Fixes: 2e5eda4681f9 ("habanalabs: PCIe Advanced Error Reporting support") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Oded Gabbay	c9d2f5cf27	habanalabs: print firmware versions Firmware in habanalabs devices is composed of several components. During device initialization, we read these versions from the device. Print them during device initialization to allow better visibility in automated systems. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Omer Shpigelman	4efb6b2b46	habanalabs: add hard reset timeout for PLDM Hard reset flow on PLDM might take more than 2 minutes. Hence add a dedicated hard reset timeout of 6 minutes for PLDM. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:42 +03:00
Bharat Jauhari	4b09901cf7	habanalabs: enable dram scramble before linux f/w In current code, for dynamic f/w loading flow, DRAM scrambling is enabled post Linux fit image is loaded to the card. This can cause the device CPU to go into reset state. The correct sequence should be: 1. Load boot fit image 2. Enable scrambling 3. Load Linux fit image This commit aligns the DRAM scrambling enabling with the static f/w load flow. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ofir Bitton	358526be82	habanalabs: enable stop on error for all QMANs and engines If there is an error in the QMAN/engine, there is no point of trying to continue running the workload. It is better to stop to allow the user to debug the program. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ohad Sharabi	e1222c2794	habanalabs: report EQ fault during heartbeat In case we have EQ fault we would like to know about it. For this, a status bitmask was added in which EQ_FAULT bit is set by FW in case of EQ fault. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Koby Elbaz	12d133deb3	habanalabs: small code refactoring Use datatype defines instead of hard coded values, and rename set_fixed_properties function. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Oded Gabbay	f1a29770b2	habanalabs/gaudi: use standard error codes When there is an ECC error in the HBM, return a standard error code, -EIO in this case, and not a positive value. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ohad Sharabi	0f37510ca3	habanalabs: fix mask to obtain page offset When converting virtual address to physical we need to add correct offset to the physical page. For this we need to use mask that include ALL bits of page offset. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ohad Sharabi	6a785e368a	habanalabs: skip valid test for boot_dev_sts regs Get rid of the need to check if boot_dev_sts is valid on every access to value read from these registers. This is done by storing the register value in hdev props ONLY if register is enabled. This way if register is NOT enabled all capability bits will not be set. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ofir Bitton	84586de496	habanalabs: reset device upon FD close if not idle If device is not idle after user closes the FD we must reset device as next user that will try to open FD will encounter a non-functional device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Yuri Nudelman	8e8125f192	habanalabs: add debug flag to prevent failure on timeout Sometimes it is useful to allow the command to continue running despite the timeout occurred, to differentiate between really stuck or just very time consuming commands. This can be achieved by passing a new debug flag alongside the cs, HL_CS_FLAGS_SKIP_RESET_ON_TIMEOUT. Anyway, if the timeout occurred, a warning print shall be issued, however this shall not fail the submission. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ofir Bitton	254fac6d1a	habanalabs/gaudi: add FW alive event support In order for driver to be aware of process or thread crashes inside GAUDI's CPU, we introduce a new event which contains all relevant information. Upon event reception, driver will dump information and will reset the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ofir Bitton	a39725819c	habanalabs/gaudi: don't use disabled ports in collective wait In the collective wait, we put jobs on the QMANs of all the NICs. The code takes into account if a port is disabled only in case of PCI card. When this info arrives from the f/w, the code doesn't take it into account, and it tries to schedule jobs on NICs that aren't enabled and thats a bug. To fix this, after the f/w sends us the list of disabled ports, we update the state of the QMANs according to that list. In addition, we need to update the HW_CAP bits so the collective wait operation will not try to use those QMANs. We also need to update the collective master monitor mask. Moreover, we need to add a protection for such future cases and in case the user will try to submit work to those QMANs. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Oded Gabbay	5a967fb3a7	habanalabs/gaudi: update to latest f/w specs Update the firmware interface files to their latest version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Ofir Bitton	5bc691d849	habanalabs/gaudi: split host irq interfaces towards FW Current implementation uses a single interrupt interface towards FW, this interface is causing races between interrupt types. We split this interface to interface per interrupt type. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Oded Gabbay	135ade0c6a	habanalabs: prefer ASYNC device probing There is no dependency when probing multiple devices so indicate to the kernel that it can probe our devices in ASYNC fashion. This shortens insmod of the driver from ~2 minutes to 20 seconds on a system with 8 devices. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:41 +03:00
Tomer Tayar	ae151bcfab	habanalabs/gaudi: add ARB to QM stop on error masks Update the QM stop on error masks to also stop on ARB errors. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:40 +03:00
Oded Gabbay	9081021029	habanalabs/gaudi: don't use nic_ports_mask in compute nic_ports_mask is used by the networking part of the driver. In the compute part, we use the HW_CAP bits to select what is active and what is not. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:40 +03:00
Koby Elbaz	b92c637c5f	habanalabs/gaudi: set the correct cpu_id on MME2_QM failure This fix was applied since there was an incorrect reported CPU ID to GIC such that an error in MME2 QMAN aliased to be an arriving from DMA0_QM. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:40 +03:00
Oded Gabbay	a60d075c81	habanalabs/gaudi: refactor reset code After all the latest changes to the reset code, there were some redundancy and errors in the flows. If the Linux FIT is loaded to the ASIC CPU, we need to communicate with it only via GIC. If it is not loaded, we need to either use COMMS protocol (for newer f/w) or MSG_TO_CPU register (for older f/w). In addition, if we halted the device CPU then we need to mark that the driver will do the reset, regardless of the capabilities. Also, to prevent false errors, we need to keep track whether the device CPU was already halted. If so, we shouldn't try to halt it again. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-06-18 15:23:40 +03:00

1 2 3 4 5 ...

727 Commits