4f3fa571a4
Document High Bandwidth Memory (HBM) and AMD heterogeneous system topology and enumeration. [ bp: Simplify and de-marketize, unify, massage. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-4-muralimk@amd.com
299 lines
10 KiB
ReStructuredText
299 lines
10 KiB
ReStructuredText
Error Detection And Correction (EDAC) Devices
|
|
=============================================
|
|
|
|
Main Concepts used at the EDAC subsystem
|
|
----------------------------------------
|
|
|
|
There are several things to be aware of that aren't at all obvious, like
|
|
*sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
|
|
etc...
|
|
|
|
These are some of the many terms that are thrown about that don't always
|
|
mean what people think they mean (Inconceivable!). In the interest of
|
|
creating a common ground for discussion, terms and their definitions
|
|
will be established.
|
|
|
|
* Memory devices
|
|
|
|
The individual DRAM chips on a memory stick. These devices commonly
|
|
output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
|
|
provides the number of bits that the memory controller expects:
|
|
typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
|
|
|
|
* Memory Stick
|
|
|
|
A printed circuit board that aggregates multiple memory devices in
|
|
parallel. In general, this is the Field Replaceable Unit (FRU) which
|
|
gets replaced, in the case of excessive errors. Most often it is also
|
|
called DIMM (Dual Inline Memory Module).
|
|
|
|
* Memory Socket
|
|
|
|
A physical connector on the motherboard that accepts a single memory
|
|
stick. Also called as "slot" on several datasheets.
|
|
|
|
* Channel
|
|
|
|
A memory controller channel, responsible to communicate with a group of
|
|
DIMMs. Each channel has its own independent control (command) and data
|
|
bus, and can be used independently or grouped with other channels.
|
|
|
|
* Branch
|
|
|
|
It is typically the highest hierarchy on a Fully-Buffered DIMM memory
|
|
controller. Typically, it contains two channels. Two channels at the
|
|
same branch can be used in single mode or in lockstep mode. When
|
|
lockstep is enabled, the cacheline is doubled, but it generally brings
|
|
some performance penalty. Also, it is generally not possible to point to
|
|
just one memory stick when an error occurs, as the error correction code
|
|
is calculated using two DIMMs instead of one. Due to that, it is capable
|
|
of correcting more errors than on single mode.
|
|
|
|
* Single-channel
|
|
|
|
The data accessed by the memory controller is contained into one dimm
|
|
only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
|
|
one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
|
|
memories. FB-DIMM and RAMBUS use a different concept for channel, so
|
|
this concept doesn't apply there.
|
|
|
|
* Double-channel
|
|
|
|
The data size accessed by the memory controller is interlaced into two
|
|
dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
|
|
bits with ECC), the data flows to the CPU using a 128 bits parallel
|
|
access.
|
|
|
|
* Chip-select row
|
|
|
|
This is the name of the DRAM signal used to select the DRAM ranks to be
|
|
accessed. Common chip-select rows for single channel are 64 bits, for
|
|
dual channel 128 bits. It may not be visible by the memory controller,
|
|
as some DIMM types have a memory buffer that can hide direct access to
|
|
it from the Memory Controller.
|
|
|
|
* Single-Ranked stick
|
|
|
|
A Single-ranked stick has 1 chip-select row of memory. Motherboards
|
|
commonly drive two chip-select pins to a memory stick. A single-ranked
|
|
stick, will occupy only one of those rows. The other will be unused.
|
|
|
|
.. _doubleranked:
|
|
|
|
* Double-Ranked stick
|
|
|
|
A double-ranked stick has two chip-select rows which access different
|
|
sets of memory devices. The two rows cannot be accessed concurrently.
|
|
|
|
* Double-sided stick
|
|
|
|
**DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
|
|
|
|
A double-sided stick has two chip-select rows which access different sets
|
|
of memory devices. The two rows cannot be accessed concurrently.
|
|
"Double-sided" is irrespective of the memory devices being mounted on
|
|
both sides of the memory stick.
|
|
|
|
* Socket set
|
|
|
|
All of the memory sticks that are required for a single memory access or
|
|
all of the memory sticks spanned by a chip-select row. A single socket
|
|
set has two chip-select rows and if double-sided sticks are used these
|
|
will occupy those chip-select rows.
|
|
|
|
* Bank
|
|
|
|
This term is avoided because it is unclear when needing to distinguish
|
|
between chip-select rows and socket sets.
|
|
|
|
* High Bandwidth Memory (HBM)
|
|
|
|
HBM is a new memory type with low power consumption and ultra-wide
|
|
communication lanes. It uses vertically stacked memory chips (DRAM dies)
|
|
interconnected by microscopic wires called "through-silicon vias," or
|
|
TSVs.
|
|
|
|
Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
|
|
interconnect called the "interposer". Therefore, HBM's characteristics
|
|
are nearly indistinguishable from on-chip integrated RAM.
|
|
|
|
Memory Controllers
|
|
------------------
|
|
|
|
Most of the EDAC core is focused on doing Memory Controller error detection.
|
|
The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
|
|
to describe the memory controllers, with is an opaque struct for the EDAC
|
|
drivers. Only the EDAC core is allowed to touch it.
|
|
|
|
.. kernel-doc:: include/linux/edac.h
|
|
|
|
.. kernel-doc:: drivers/edac/edac_mc.h
|
|
|
|
PCI Controllers
|
|
---------------
|
|
|
|
The EDAC subsystem provides a mechanism to handle PCI controllers by calling
|
|
the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
|
|
:c:type:`edac_pci_ctl_info` to describe the PCI controllers.
|
|
|
|
.. kernel-doc:: drivers/edac/edac_pci.h
|
|
|
|
EDAC Blocks
|
|
-----------
|
|
|
|
The EDAC subsystem also provides a generic mechanism to report errors on
|
|
other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
|
|
|
|
The structures :c:type:`edac_dev_sysfs_block_attribute`,
|
|
:c:type:`edac_device_block`, :c:type:`edac_device_instance` and
|
|
:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
|
|
representation at sysfs.
|
|
|
|
This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
|
|
PCI, like:
|
|
|
|
- CPU caches (L1 and L2)
|
|
- DMA engines
|
|
- Core CPU switches
|
|
- Fabric switch units
|
|
- PCIe interface controllers
|
|
- other EDAC/ECC type devices that can be monitored for
|
|
errors, etc.
|
|
|
|
It allows for a 2 level set of hierarchy.
|
|
|
|
For example, a cache could be composed of L1, L2 and L3 levels of cache.
|
|
Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
|
|
caches. On such case, those can be represented via the following sysfs
|
|
nodes::
|
|
|
|
/sys/devices/system/edac/..
|
|
|
|
pci/ <existing pci directory (if available)>
|
|
mc/ <existing memory device directory>
|
|
cpu/cpu0/.. <L1 and L2 block directory>
|
|
/L1-cache/ce_count
|
|
/ue_count
|
|
/L2-cache/ce_count
|
|
/ue_count
|
|
cpu/cpu1/.. <L1 and L2 block directory>
|
|
/L1-cache/ce_count
|
|
/ue_count
|
|
/L2-cache/ce_count
|
|
/ue_count
|
|
...
|
|
|
|
the L1 and L2 directories would be "edac_device_block's"
|
|
|
|
.. kernel-doc:: drivers/edac/edac_device.h
|
|
|
|
|
|
Heterogeneous system support
|
|
----------------------------
|
|
|
|
An AMD heterogeneous system is built by connecting the data fabrics of
|
|
both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
|
|
GPU nodes can be accessed the same way as the data fabric on CPU nodes.
|
|
|
|
The MI200 accelerators are data center GPUs. They have 2 data fabrics,
|
|
and each GPU data fabric contains four Unified Memory Controllers (UMC).
|
|
Each UMC contains eight channels. Each UMC channel controls one 128-bit
|
|
HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
|
|
of 4096-bits of DRAM data bus.
|
|
|
|
While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
|
|
channel is interfacing 2GB of DRAM (represented as rank).
|
|
|
|
Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
|
|
|
|
GPU DF / GPU Node -> EDAC MC
|
|
GPU UMC -> EDAC CSROW
|
|
GPU UMC channel -> EDAC CHANNEL
|
|
|
|
For example: a heterogeneous system with 1 AMD CPU is connected to
|
|
4 MI200 (Aldebaran) GPUs using xGMI.
|
|
|
|
Some more heterogeneous hardware details:
|
|
|
|
- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
|
|
They have chip selects (csrows) and channels. However, the layouts are different
|
|
for performance, physical layout, or other reasons.
|
|
- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
|
|
marketing speak. CPU has X memory channels, etc.
|
|
- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
|
|
- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
|
|
- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
|
|
|
|
The EDAC subsystem provides a mechanism to handle AMD heterogeneous
|
|
systems by calling system specific ops for both CPUs and GPUs.
|
|
|
|
AMD GPU nodes are enumerated in sequential order based on the PCI
|
|
hierarchy, and the first GPU node is assumed to have a Node ID value
|
|
following those of the CPU nodes after latter are fully populated::
|
|
|
|
$ ls /sys/devices/system/edac/mc/
|
|
mc0 - CPU MC node 0
|
|
mc1 |
|
|
mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
|
|
mc3 |
|
|
mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
|
|
mc5 |
|
|
mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
|
|
mc7 |
|
|
mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
|
|
|
|
For example, a heterogeneous system with one AMD CPU is connected to
|
|
four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
|
|
via the following sysfs entries::
|
|
|
|
/sys/devices/system/edac/mc/..
|
|
|
|
CPU # CPU node
|
|
├── mc 0
|
|
|
|
GPU Nodes are enumerated sequentially after CPU nodes have been populated
|
|
GPU card 1 # Each MI200 GPU has 2 nodes/mcs
|
|
├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
|
|
│ ├── csrow 0 # UMC 0
|
|
│ │ ├── channel 0 # Each UMC has 8 channels
|
|
│ │ ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
|
|
│ │ ├── channel 2
|
|
│ │ ├── channel 3
|
|
│ │ ├── channel 4
|
|
│ │ ├── channel 5
|
|
│ │ ├── channel 6
|
|
│ │ ├── channel 7
|
|
│ ├── csrow 1 # UMC 1
|
|
│ │ ├── channel 0
|
|
│ │ ├── ..
|
|
│ │ ├── channel 7
|
|
│ ├── .. ..
|
|
│ ├── csrow 3 # UMC 3
|
|
│ │ ├── channel 0
|
|
│ │ ├── ..
|
|
│ │ ├── channel 7
|
|
│ ├── rank 0
|
|
│ ├── .. ..
|
|
│ ├── rank 31 # total 32 ranks/dimms from 4 UMCs
|
|
├
|
|
├── mc 2 # GPU node 1 == mc2
|
|
│ ├── .. # each GPU has total 64 GB
|
|
|
|
GPU card 2
|
|
├── mc 3
|
|
│ ├── ..
|
|
├── mc 4
|
|
│ ├── ..
|
|
|
|
GPU card 3
|
|
├── mc 5
|
|
│ ├── ..
|
|
├── mc 6
|
|
│ ├── ..
|
|
|
|
GPU card 4
|
|
├── mc 7
|
|
│ ├── ..
|
|
├── mc 8
|
|
│ ├── ..
|