AMD Zen-based systems report memory errors through Machine Check banks representing Unified Memory Controllers (UMCs). The address value reported for DRAM ECC errors is a "normalized address" that is relative to the UMC. This normalized address must be converted to a system physical address to be usable by the OS. Support for this address translation was introduced to the MCA subsystem with Zen1 systems. The code was later moved to the AMD64 EDAC module, since this was the only user of the code at the time. However, there are uses for this translation outside of EDAC. The system physical address can be used in MCA for preemptive page offlining as done in some MCA notifier functions. Also, this translation is needed as the basis of similar functionality needed for some CXL configurations on AMD systems. Introduce a common address translation library that can be used for multiple subsystems including MCA, EDAC, and CXL. Include support for UMC normalized to system physical address translation for current CPU systems. The Data Fabric Indirect register access offsets and one of the register fields were changed. Default to the current offsets and register field definition. And fallback to the older values if running on a "legacy" system. Provide built-in code to facilitate the loading and unloading of the library module without affecting other modules or built-in code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240123041401.79812-2-yazen.ghannam@amd.com
38 lines
1.5 KiB
Plaintext
38 lines
1.5 KiB
Plaintext
# SPDX-License-Identifier: GPL-2.0-only
|
|
menuconfig RAS
|
|
bool "Reliability, Availability and Serviceability (RAS) features"
|
|
help
|
|
Reliability, availability and serviceability (RAS) is a computer
|
|
hardware engineering term. Computers designed with higher levels
|
|
of RAS have a multitude of features that protect data integrity
|
|
and help them stay available for long periods of time without
|
|
failure.
|
|
|
|
Reliability can be defined as the probability that the system will
|
|
produce correct outputs up to some given time. Reliability is
|
|
enhanced by features that help to avoid, detect and repair hardware
|
|
faults.
|
|
|
|
Availability is the probability a system is operational at a given
|
|
time, i.e. the amount of time a device is actually operating as the
|
|
percentage of total time it should be operating.
|
|
|
|
Serviceability or maintainability is the simplicity and speed with
|
|
which a system can be repaired or maintained; if the time to repair
|
|
a failed system increases, then availability will decrease.
|
|
|
|
Note that Reliability and Availability are distinct concepts:
|
|
Reliability is a measure of the ability of a system to function
|
|
correctly, including avoiding data corruption, whereas Availability
|
|
measures how often it is available for use, even though it may not
|
|
be functioning correctly. For example, a server may run forever and
|
|
so have ideal availability, but may be unreliable, with frequent
|
|
data corruption.
|
|
|
|
if RAS
|
|
|
|
source "arch/x86/ras/Kconfig"
|
|
source "drivers/ras/amd/atl/Kconfig"
|
|
|
|
endif
|