IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The register which needs to be cleared is the valid register instead
of the address.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Fix an issue in hard reset flow in which the driver didn't send a
disable pci message if there was an active compute context.
In hard reset, disable pci message should be sent no matter if
a compute context exists or not.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
The disable pci message is sent in reset device. It informs the FW not
to raise more EQs. The Driver may ignore received EQs, when the device
is in disabled mode.
The duplication happens when hard reset is scheduled during compute
reset and also performs 'escalate_reset_flow'.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
COMMS protocol is used for LKD <--> FW communication, and any
communication failure between the two might turn out to be
destructive, hence, it should be well emphasized.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
since the function might fail and we should propagate the failure.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
When the device is in disabled state, the driver isn't suppose to
receive any events from FW. Printing the event type, as part of the
message that was already printed, shall help to get more info if this
unexpected message is received.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Once a memory mapping is added to the page tables, it's followed by
a TLB invalidation request which could potentially fail (HW failure).
Removing the mapping is simply a part of this failure handling routine.
TLB invalidation failure prints were updated to be more accurate.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;
pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}
pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.
Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
HWmon core receives an array of pointers to hwmon_channel_info and it
does not modify it, thus it can be array of const pointers for safety.
This allows drivers to make them also const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
As mmu refactor and nic resume are not relevant anymore, remove
their TODO comments.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
FW had one data route for tpc0 and tpc1 when running in secured mode
and a different one when running without secured mode. After fw fixed
this issue, both mode have the same data path.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order for the user to be aware of unexpected events in Gaudi2 that
aren't assigned to a specific engine, we are adding the handling of
this dedicated interrupt.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Replace initialization of "struct cpucp_packet" from "{0} to "{}" to
avoid a "missing braces around initializer" compilation warning.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
After getting page fault in gaudi2, we need to clear the valid bit
instead of the address.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
All engine masks are exposed to user, make sure user gets the
correct rotator enabled mask in gaudi2.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Some names of events has been modified/added.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We expose this in order for user applications to know how much dram
is reserved for internal use.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Remove all '\n' from strings which are passed as arguments to
gaudi2_print_event(), because the newline character is added internally
in this function.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Now that CQ-completion based jobs do not trigger a reset upon failure,
failure of such jobs (e.g., MMU cache invalidation) should be handled
by the caller itself depending on the error code returned to it.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Since the err_cause register is unprivileged, we should read it from
the driver instead of using the param that came from the FW.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Due to a firmware bug we need to increase reset poll timeout
or else we will timeout in secured environments.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
There is a spelling mistake in a dev_err message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something. So just remove it and fix up all callers of the function in
the kernel tree at the same time.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages. Since
commit f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native"),
the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.
Remove the redundant pci_enable_pcie_error_reporting() call from the
driver. Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.
Note that this only controls ERR_* Messages from the device. An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.
The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.
Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We can allow userspace to query the dram usage during soft-reset.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Is appears that the flag -
DCORE0_TPC0_CFG_STATUS_VECTOR_PIPE_EMPTY_MASK, has no actual use when
it comes to querying TPC idleness, since this flag's corresponding bit
turns-off after stalling the engine, and turns back on after resuming
it.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add notifications to user in case of decoder abnormal interrupts, and
use the graceful reset mechanism if reset is required.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
is_idle() was too long, so break it up for readability.
In addition, we can now use the new sub-routines from other places.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Compute driver threads names will start with hlX-*, when X is the
device id.
This will help distinguish them from the NIC thread names.
Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add a helper function to search the vm hash for a node with a given
virtual address.
As opposed to the current code, this function explicitly returns NULL
when no node is found, instead of basing on the loop cursor object's
value.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
'irq_handler' in gaudi2_enable_msix(), is just assigned with a function
name and then used when calling request_threaded_irq().
Remove the variable and use the function name directly as an argument.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
We later use cpucp packet for soft reset which might fail
so we should be able propagate the failure case.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Remove leading zeroes when printing the idle mask to make it clearer.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Make the comments align with the order of the fields in the structure
Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>