linux

iv/linux

Author	SHA1	Message	Date
Linus Torvalds	aa35a4835e	Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add initial support for RAS hardware found on AMD server GPUs (MI200). Those GPUs and CPUs are connected together through the coherent fabric and the GPU memory controllers report errors through x86's MCA so EDAC needs to support them. The amd64_edac driver supports now HBM (High Bandwidth Memory) and thus such heterogeneous memory controller systems - Other small cleanups and improvements * tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/amd64: Cache and use GPU node map EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh EDAC/amd64: Document heterogeneous system enumeration x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors x86/amd_nb: Re-sort and re-indent PCI defines x86/amd_nb: Add MI200 PCI IDs ras/debugfs: Fix error checking for debugfs_create_dir() x86/MCE: Check a hw error's address to determine proper recovery action	2023-06-26 15:09:18 -07:00
Yazen Ghannam	4251566ebc	EDAC/amd64: Cache and use GPU node map AMD systems have historically provided an "AMD Node ID" that is a unique identifier for each die in a multi-die package. This was associated with a unique instance of the AMD Northbridge on a legacy system. And now it is associated with a unique instance of the AMD Data Fabric on modern systems. Each instance is referred to as a "Node"; this is an AMD-specific term not to be confused with NUMA nodes. The data fabric provides a number of interfaces accessible through a set of functions in a single PCI device. There is one PCI device per Data Fabric (AMD Node), and multi-die systems will see multiple such PCI devices. The AMD Node ID matches a Node's position in the PCI hierarchy. For example, the Node 0 is accessed using the first PCI device, Node 1 is accessed using the second, and so on. A logical CPU can find its AMD Node ID using CPUID. Furthermore, the AMD Node ID is used within the hardware fabric, so it is not purely a logical value. Heterogeneous AMD systems, with a CPU Data Fabric connected to GPU data fabrics, follow a similar convention. Each CPU and GPU die has a unique AMD Node ID value, and each Node ID corresponds to PCI devices in sequential order. However, there are two caveats: 1) GPUs are not x86, and they don't have CPUID to read their AMD Node ID like on CPUs. This means the value is more implicit and based on PCI enumeration and hardware-specifics. 2) There is a gap in the hardware values for AMD Node IDs. Values 0-7 are for CPUs and values 8-15 are for GPUs. For example, a system with one CPU die and two GPUs dies will have the following values: CPU0 -> AMD Node 0 GPU0 -> AMD Node 8 GPU1 -> AMD Node 9 EDAC is the only subsystem where this has a practical effect. Memory errors on AMD systems are commonly reported through MCA to a CPU on the local AMD Node. The error information is passed along to EDAC where the AMD EDAC modules use the AMD Node ID of reporting logical CPU to access AMD Node information. However, memory errors from a GPU die will be reported to the CPU die. Therefore, the logical CPU's AMD Node ID can't be used since it won't match the AMD Node ID of the GPU die. The AMD Node ID of the GPU die is provided as part of the MCA information, and the value will match the hardware enumeration (e.g. 8-15). Handle this situation by discovering GPU dies the same way as CPU dies in the AMD NB code. But do a "node id" fixup in AMD64 EDAC where it's needed. The GPU data fabrics provide a register with the base AMD Node ID for their local "type", i.e. GPU data fabric. This value is the same for all fabrics of the same type in a system. Read and cache the base AMD Node ID from one of the GPU devices during module initialization. Use this to fixup the "node id" when reporting memory errors at runtime. [ bp: Squash a fix making gpu_node_map static as reported by Tom Rix <trix@redhat.com>. Link: https://lore.kernel.org/r/20230610210930.174074-1-trix@redhat.com ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-6-muralimk@amd.com	2023-06-19 13:01:44 +02:00
Yazen Ghannam	b34348a0d7	EDAC/amd64: Remove module version string The AMD64 EDAC module version information is not exposed through ABI like MODULE_VERSION(). Instead it is printed during module init. Version numbers can be confusing in cases where module updates are partly backported resulting in a difference between upstream and backported module versions. Remove the AMD64 EDAC module version information to avoid user confusion. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230410190959.3367528-1-yazen.ghannam@amd.com	2023-05-10 15:49:52 +02:00
Muralidhara M K	b3ece3a6a2	EDAC/amd64: Add get_err_info() to pvt->ops GPU Nodes will use a different method to determine the chip select and channel of an error. A function pointer should be used rather than introduce another branching condition. Prepare for this by adding get_err_info() to pvt->ops. This function is only called from the modern code path, so a legacy function is not defined. Make sure to call this after MCA_STATUS[SyndV] is checked, since the csrow value is found in MCA_SYND. [ Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-23-yazen.ghannam@amd.com	2023-03-24 13:03:21 +01:00
Muralidhara M K	f6f36382d6	EDAC/amd64: Split dump_misc_regs() into dct/umc functions Add a function pointer to pvt->ops. No functional change is intended. [ Yazen: Rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-22-yazen.ghannam@amd.com	2023-03-24 13:03:21 +01:00
Muralidhara M K	0a42a37f65	EDAC/amd64: Split setup_mci_misc_attrs() into dct/umc functions The init_one_instance() path is shared between legacy and modern systems. So add the new functions to a function pointer in pvt->ops. No functional change is intended. [ Yazen: Rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-18-yazen.ghannam@amd.com	2023-03-24 13:03:20 +01:00
Muralidhara M K	eb2bcdfc37	EDAC/amd64: Split ecc_enabled() into dct/umc functions Call them using a function pointer in pvt->ops. The "ECC enabled" check is done outside of the hardware information gathering done in hw_info_get(). So a high-level function pointer is needed to separate the legacy and modern paths. No functional change is intended. [Yazen: rebased/reworked patch and reworded commit message. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-17-yazen.ghannam@amd.com	2023-03-24 13:03:20 +01:00
Yazen Ghannam	9a97a7f4d7	EDAC/amd64: Rework hw_info_{get,put} The bulk of system-specific information is gathered at init time with hw_info_get(). This function calls a number of helper functions, and many of these helper functions are split between a modern UMC/DF path and a legacy DCT path. Split hw_info_get() into legacy and modern versions. This creates two separate code paths early on, and legacy and modern helper functions can be called directly in the appropriate code path. Also, simplify hw_info_put() and share it between legacy and modern systems. NULL pointer checks are done in pci_dev_put() and kfree(), so they can be called unconditionally. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-12-yazen.ghannam@amd.com	2023-03-24 13:03:20 +01:00
Muralidhara M K	ed623d55ee	EDAC/amd64: Merge struct amd64_family_type into struct amd64_pvt Future AMD systems will support heterogeneous "AMD Node" types, e.g. CPU and GPU types. Therefore, a global family type shared across all AMD nodes is no longer appropriate. Move struct low_ops routines and members of struct amd64_family_type to struct amd64_pvt. Currently, there are many code branches that split between "modern" and "legacy" systems. Another code branch will be needed in order to cover GPU cases. However, rather than introduce another branching case in multiple functions, the current branching code should be switched to a set of function pointers. This change makes the code more readable and simplifies adding support for new families/models. In order to reuse code, define two sets of function pointers. Use one for modern systems (Family 17h and later). This will not change between current CPU families. Use another set of function pointers for legacy systems (before Family 17h). Use the Family 16h versions as default for the legacy ops since these are the latest, and adjust the function pointers as needed for older families. [ Yazen: rebased/reworked patch and reworded commit message. ] [ bp: Fix rev8 or later check. ] Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-11-yazen.ghannam@amd.com	2023-03-24 13:03:19 +01:00
Yazen Ghannam	c4605bde33	EDAC/amd64: Remove early_channel_count() The early_channel_count() function seems to have been useful in the past for knowing how many EDAC mci structures to populate. However, this is no longer needed as the maximum channel count for a system is used instead. Remove the early_channel_count() helper functions and related code. Use the size of the channel layer when iterating over channel structures. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-6-yazen.ghannam@amd.com	2023-02-09 14:43:39 +01:00
Yazen Ghannam	cf981562e6	EDAC/amd64: Remove PCI Function 0 PCI Function 0 is used on Family 17h and later only to read the "dhar" value. This value is printed and provided through a module-specific debug sysfs file. The value is not used for any Family 17h and later code, and it does not have any apparent debug value on these systems. Remove "dhar", Function 0 PCI IDs, and all related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-5-yazen.ghannam@amd.com	2023-02-09 14:41:56 +01:00
Yazen Ghannam	6229235f7c	EDAC/amd64: Remove PCI Function 6 PCI Function 6 is used on Family 17h and later to access scrub registers. With scrub access removed, this function has no other use. Remove all Function 6 PCI IDs and related code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-4-yazen.ghannam@amd.com	2023-02-09 11:50:52 +01:00
Yazen Ghannam	6e241bc93c	EDAC/amd64: Remove scrub rate control for Family 17h and later The scrub registers on AMD Family 17h and later may be inaccessible to the OS. Furthermore, hardware designers recommend that the scrubbing feature is managed by the firmware. Remove support for the sdram_scrub_rate interface for AMD Family 17h systems and later by not setting the scrub function pointers. The EDAC MC core will then not expose the scrub files in sysfs. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127170419.1824692-3-yazen.ghannam@amd.com	2023-02-09 11:25:21 +01:00
Yazen Ghannam	2151c84ece	EDAC/amd64: Add new register offset support and related changes Introduce a "family flags" bitmask that can be used to indicate any special behavior needed on a per-family basis. Add a flag to indicate a system uses the new register offsets introduced with Family 19h Model 10h. Use this flag to account for register offset changes, a new bitfield indicating DDR5 use on a memory controller, and to set the proper number of chip select masks. Rework f17_addr_mask_to_cs_size() to properly handle the change in chip select masks. And update code comments to reflect the updated Chip Select, DIMM, and Mask relationships. [uninitialized variable warning] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-3-yazen.ghannam@amd.com	2022-02-23 22:01:33 +01:00
Yazen Ghannam	75aeaaf23d	EDAC/amd64: Set memory type per DIMM Current AMD systems allow mixing of DIMM types within a system. However, DIMMs within a channel, i.e. managed by a single Unified Memory Controller (UMC), must be of the same type. Handle this possible configuration by checking and setting the memory type for each individual "UMC" structure. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-2-yazen.ghannam@amd.com	2022-02-23 21:53:45 +01:00
Marc Bevand	0b8bf9cb14	EDAC/amd64: Add support for family 19h, models 50h-5fh Add the new family 19h models 50h-5fh PCI IDs (device 18h functions 0 and 6) to support Ryzen 5000 APUs ("Cezanne"). Signed-off-by: Marc Bevand <m@zorinaq.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20211221233112.556927-1-m@zorinaq.com	2021-12-24 10:53:41 +01:00
Yazen Ghannam	e2be5955a8	EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh Add a new family type for AMD Family 19h Models 10h to 1Fh. Use this new family type for Models A0h to AFh also. Increase the maximum number of controllers from 8 to 12. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208174356.1997855-3-yazen.ghannam@amd.com	2021-12-10 12:54:33 +01:00
Borislav Petkov	61810096de	EDAC/amd64: Merge error injection sysfs facilities Merge them into the main driver and put them inside an EDAC_DEBUG ifdeffery to simplify the driver and have all debugging/injection stuff behind a debug build-time switch. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20201215110517.5215-2-bp@alien8.de	2020-12-28 19:36:25 +01:00
Borislav Petkov	2a28ceef00	EDAC/amd64: Merge sysfs debugging attributes setup code There's no need for them to be in a separate file so merge them into the main driver compilation unit like the other EDAC drivers do. Drop now-unneeded function export, make the function static and shorten static function names. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20201215110517.5215-1-bp@alien8.de	2020-12-28 19:36:17 +01:00
Alexander Monakov	b6bea24d41	EDAC/amd64: Add AMD family 17h model 60h PCI IDs Add support for AMD Renoir (4000-series Ryzen CPUs). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru	2020-05-22 18:43:13 +02:00
Yazen Ghannam	2eb61c91c3	EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh Add family ops to support AMD Family 19h systems. Existing Family 17h functions can be used. Also, add Family 19h to the list of families to automatically load the module. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200110015651.14887-5-Yazen.Ghannam@amd.com	2020-01-16 17:09:23 +01:00
Yazen Ghannam	5e4c55276a	EDAC/amd64: Save max number of controllers to family type The maximum number of memory controllers is fixed within a family/model group. In most cases, this has been fixed at 2, but some systems may have up to 8. The struct amd64_family_type already contains family/model-specific information, and this can be used rather than adding model checks to various functions. Create a new field in struct amd64_family_type for max_mcs. Set this when setting other family type information, and use this when needing the maximum number of memory controllers possible for a system. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-4-Yazen.Ghannam@amd.com	2019-11-06 11:07:01 +01:00
Isaac Vaughn	3e443eb353	EDAC/amd64: Add PCI device IDs for family 17h, model 70h Add the new Family 17h Model 70h PCI IDs (device 18h functions 0 and 6) to the AMD64 EDAC module. [ bp: s/f17_base_addr_to_cs_size/f17_addr_mask_to_cs_size/g ] Signed-off-by: Isaac Vaughn <isaac.vaughn@knights.ucf.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: linux-edac@vger.kernel.org Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190906192131.8ced0ca112146f32d82b6cae@knights.ucf.edu	2019-09-07 07:29:27 +02:00
Yazen Ghannam	81f5090db8	EDAC/amd64: Support asymmetric dual-rank DIMMs Future AMD systems will support asymmetric dual-rank DIMMs. These are DIMMs where the ranks are of different sizes. The even rank will use the Primary Even Chip Select registers and the odd rank will use the Secondary Odd Chip Select registers. Recognize if a Secondary Odd Chip Select is being used. Use the Secondary Odd Address Mask when calculating the chip select size. [ bp: move csrow_sec_enabled() to the header, fix CS_ODD define and tone-down the capitalized words spelling. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-8-Yazen.Ghannam@amd.com	2019-08-23 16:09:52 +02:00
Yazen Ghannam	7574729e91	EDAC/amd64: Cache secondary Chip Select registers AMD Family 17h systems have a set of secondary Chip Select Base Addresses and Address Masks. These do not represent unique Chip Selects, rather they are used in conjunction with the primary Chip Select registers in certain cases. Cache these secondary Chip Select registers for future use. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-7-Yazen.Ghannam@amd.com	2019-08-23 12:55:05 +02:00
Yazen Ghannam	d971e28e2c	EDAC/amd64: Support more than two controllers for chip selects handling The struct chip_select array that's used for saving chip select bases and masks is fixed at length of two. There should be one struct chip_select for each controller, so this array should be increased to support systems that may have more than two controllers. Increase the size of the struct chip_select array to eight, which is the largest number of controllers per die currently supported on AMD systems. Fix number of DIMMs and Chip Select bases/masks on Family17h, because AMD Family 17h systems support 2 DIMMs, 4 CS bases, and 2 CS masks per channel. Also, carve out the Family 17h+ reading of the bases/masks into a separate function. This effectively reverts the original bases/masks reading code to before Family 17h support was added. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-2-Yazen.Ghannam@amd.com	2019-08-22 19:08:49 +02:00
Borislav Petkov	8de9930a46	Revert "EDAC/amd64: Support more than two controllers for chip select handling" This reverts commit `0a227af521`. Unfortunately, this commit caused wrong detection of chip select sizes on some F17h client machines: --- 00-rc6+ 2019-02-14 14:28:03.126622904 +0100 +++ 01-rc4+ 2019-04-14 21:06:16.060614790 +0200 EDAC amd64: MC: 0: 0MB 1: 0MB -EDAC amd64: MC: 2: 16383MB 3: 16383MB +EDAC amd64: MC: 2: 0MB 3: 2097151MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: UMC1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB -EDAC amd64: MC: 2: 16383MB 3: 16383MB +EDAC amd64: MC: 2: 0MB 3: 2097151MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0M Revert it for now until it has been solved properly. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com>	2019-04-25 16:30:34 +02:00
Yazen Ghannam	0a227af521	EDAC/amd64: Support more than two controllers for chip select handling The struct chip_select array that's used for saving chip select bases and masks is fixed at length of two. There should be one struct chip_select for each controller, so this array should be increased to support systems that may have more than two controllers. Increase the size of the struct chip_select array to eight, which is the largest number of controllers per die currently supported on AMD systems. Also, carve out the Family 17h+ reading of the bases/masks into a separate function. This effectively reverts the original bases/masks reading code to before Family 17h support was added. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190228153558.127292-5-Yazen.Ghannam@amd.com	2019-03-27 00:13:25 +01:00
Yazen Ghannam	7835961d37	EDAC/amd64: Recognize x16 symbol size Future AMD systems may support x16 symbol sizes. Recognize if a system is using x16 symbol size. Also, simplify the print statement. Note that a x16 syndrome vector table is not necessary like with x4 or x8 syndromes. This is because systems that support x16 symbol sizes are SMCA systems and in that case, the syndrome can be directly extracted from the MCA_SYND[Syndrome] field. [ bp: massage. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190228153558.127292-4-Yazen.Ghannam@amd.com	2019-03-27 00:13:25 +01:00
Yazen Ghannam	bdcee7747f	EDAC/amd64: Support more than two Unified Memory Controllers The first few models of Family 17h all had 2 Unified Memory Controllers per Die, so this was treated as a fixed value. However, future systems may have more Unified Memory Controllers per Die. Related to this, the channel number and base address of a Unified Memory Controller were found by matching on fixed, known values. However, current and future systems follow this pattern for the channel number and base address of a Unified Memory Controller: 0xYXXXXX, where Y is the channel number. So matching on hardcoded values is not necessary. Set the number of Unified Memory Controllers at driver init time based on the family/model. Also, update the functions that find the channel number and base address of a Unified Memory Controller to support more than two. [ bp: Move num_umcs into the .c file and simplify comment. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190228153558.127292-3-Yazen.Ghannam@amd.com	2019-03-27 00:13:25 +01:00
Yazen Ghannam	6e846239e5	EDAC/amd64: Add Family 17h Model 30h PCI IDs Add the new Family 17h Model 30h PCI IDs to the AMD64 EDAC module. This also fixes a probe failure that appeared when some other PCI IDs for Family 17h Model 30h were added to the AMD NB code. Fixes: `be3518a16e` (x86/amd_nb: Add PCI device IDs for family 17h, model 30h) Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190228153558.127292-1-Yazen.Ghannam@amd.com	2019-03-27 00:13:25 +01:00
Michael Jin	8960de4a5c	EDAC, amd64: Add Family 17h, models 10h-2fh support Add new device IDs for family 17h, models 10h-2fh. This is required by amd64_edac_mod in order to properly detect PCI device functions 0 and 6. Signed-off-by: Michael Jin <mikhail.jin@gmail.com> Reviewed-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20180816192840.31166-1-mikhail.jin@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>	2018-08-27 06:17:21 +02:00
Borislav Petkov	e62d2ca9d0	EDAC, amd64: Bump driver version Last time we did that was when we enabled Bulldozer. Now, we enabled Zen so it is only natural ... :-) Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>	2017-02-14 11:58:05 +01:00
Yazen Ghannam	1bd9900b83	EDAC, amd64: Add x86cpuid sanity check during init Match one of the devices in amd64_cpuids[] before loading the module. This is an additional sanity check against users trying to load amd64_edac_mod on unsupported systems. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1485537863-2707-9-git-send-email-Yazen.Ghannam@amd.com [ Get rid of err_ret label, make it a bit more readable this way. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2017-01-28 14:43:06 +01:00
Yazen Ghannam	df64636fa4	EDAC, amd64: Remove unused printing macros amd64_{debug,notice} don't have any users, so remove them. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1485537863-2707-6-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2017-01-28 13:20:06 +01:00
Mauro Carvalho Chehab	78d88e8a3d	edac: rename edac_core.h to edac_mc.h Now, all left at edac_core.h are at drivers/edac/edac_mc.c, so rename it to edac_mc.h. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-12-15 08:54:51 -02:00
Borislav Petkov	5246c54007	EDAC, amd64: Improve amd64-specific printing macros Prefix the warn and error macros with the respective string so that callers don't have to say "Error" or "Warning". We save us string length this way in the actual calls. While at it, shorten the calls in reserve_mc_sibling_devs(). Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>	2016-12-01 11:35:07 +01:00
Yazen Ghannam	713ad54675	EDAC, amd64: Define and register UMC error decode function How we need to decode UMC errors is different from how we decode bus errors, so let's define a new function for this. We also need a way to determine the UMC channel since we're not guaranteed that there is a fixed relation between channel and MCA bank. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1480359593-80369-1-git-send-email-Yazen.Ghannam@amd.com [ Fold in decode_synd_reg(), simplify. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-29 18:05:48 +01:00
Yazen Ghannam	07ed82ef93	EDAC, amd64: Add Fam17h debug output Read a few more UMC registers and provide debug output in order to be as similar as possible to older AMD systems. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1480344621-14966-1-git-send-email-Yazen.Ghannam@amd.com [ Remove unneeded K8 check and comments, fixup others. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-29 17:16:09 +01:00
Yazen Ghannam	8051c0af3c	EDAC, amd64: Add Fam17h scrubber support Fam17h has new register offsets and fields for setting up the DRAM scrubber so add support for this. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-17-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-28 17:50:12 +01:00
Yazen Ghannam	b64ce7cd7f	EDAC, amd64: Read MC registers on AMD Fam17h Fam17h has a different set of registers and bitfields. Most of these registers are read through SMN (System Management Network) rather than PCI config space. Also, the derivation of various values is now different. Update amd64_edac to read the appropriate registers and extract the correct values for Fam17h. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-12-git-send-email-Yazen.Ghannam@amd.com [ Save us the indentation level in read_mc_regs(), add defines ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-28 17:50:11 +01:00
Yazen Ghannam	936fc3afaa	EDAC, amd64: Reserve correct PCI devices on AMD Fam17h Fam17h needs PCI device functions 0 and 6 instead of 1 and 2 as on older systems. Update struct amd64_pvt to hold the new functions and reserve them if on Fam17h. Also, allocate an array of UMC structs within our newly allocated PVT struct. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-11-git-send-email-Yazen.Ghannam@amd.com [ init_one_instance() error handling, shorten lines, unbreak >80 cols lines. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-28 17:49:40 +01:00
Yazen Ghannam	f1cbbec9fc	EDAC, amd64: Add AMD Fam17h family type and ops Add a family type and associated ops for Fam17h. Define a struct to hold all the UMC registers that we need. Make this a part of struct amd64_pvt in order to maximize code reuse in the rest of the driver. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-10-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-24 21:24:09 +01:00
Yazen Ghannam	196b79fcc8	EDAC, amd64: Extend ecc_enabled() to Fam17h Update the ecc_enabled() function to work on Fam17h. This entails reading a different set of registers and using the SMN (System Management Network) rather than PCI devices. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-9-git-send-email-Yazen.Ghannam@amd.com [ Fixup ecc_en assignment and get_umc_base(). ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-24 21:07:03 +01:00
Borislav Petkov	3f37a36b62	EDAC, amd64_edac: Drop pci_register_driver() use - remove homegrown instances counting. - take F3 PCI device from amd_nb caching instead of F2 which was used with the PCI core. With those changes, the driver doesn't need to register a PCI driver and relies on the northbridges caching which we do anyway on AMD. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com>	2016-05-09 20:41:16 +02:00
Aravind Gopalakrishnan	1a8bc7707e	EDAC, amd64_edac: Update copyright and remove changelog Git provides us all the changelogs anyway. So trim the comments section here. Update the copyrights info while at it. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1443440593-2316-3-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2015-09-29 13:41:04 +02:00
Aravind Gopalakrishnan	da92110dfd	EDAC, amd64_edac: Extend scrub rate support to F15hM60h The scrub rate control register has moved to function 2 in PCI config space and is at a different offset on family 0x15, models 0x60 and later. The minimum recommended scrub rate has also changed. (Refer to D18F2x1c9_dct[1:0][DramScrub] in Fam15hM60h BKDG). Adjust set_scrub_rate() and get_scrub_rate() functions to accommodate this. Tested on F15hM60h, Fam15h, models 00h-0fh and Fam10h systems. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1443440593-2316-2-git-send-email-Aravind.Gopalakrishnan@amd.com [ Cleanup conditionals. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2015-09-29 13:25:33 +02:00
Takashi Iwai	e339f1ec97	EDAC: amd64: Use static attribute groups Instead of calling device_create_file() and device_remove_file() manually, pass the static attribute groups with the new edac_mc_add_mc_with_groups(). The conditional creation of inject sysfs files is done by a proper is_visible callback. Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: http://lkml.kernel.org/r/1423046938-18111-4-git-send-email-tiwai@suse.de Signed-off-by: Borislav Petkov <bp@suse.de>	2015-02-23 13:08:09 +01:00
Aravind Gopalakrishnan	a597d2a5d9	amd64_edac: Add F15h M60h support This patch adds support for ECC error decoding for F15h M60h processor. Aside from the usual changes, the patch adds support for some new features in the processor: - DDR4(unbuffered, registered); LRDIMM DDR3 support - relevant debug messages have been modified/added to report these memory types - new dbam_to_cs mappers - if (F15h M60h && LRDIMM); we need a 'multiplier' value to find cs_size. This multiplier value is obtained from the per-dimm DCSM register. So, change the interface to accept a 'cs_mask_nr' value to facilitate this calculation - switch-casing determine_memory_type() - done to cleanse the function of too many if-else statements and improve readability - This is now called early in read_mc_regs() to cache dram_type Misc cleanup: - amd64_pci_table[] is condensed by using PCI_VDEVICE macro. Testing details: Tested the patch by injecting 'ECC' type errors using mce_amd_inj and error decoding works fine. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1414617483-4941-1-git-send-email-Aravind.Gopalakrishnan@amd.com [ Boris: determine_memory_type() cleanups ] Signed-off-by: Borislav Petkov <bp@suse.de>	2014-10-30 13:42:48 +01:00
Aravind Gopalakrishnan	7981a28f1a	amd64_edac: Modify usage of amd64_read_dct_pci_cfg() Rationale behind this change: - F2x1xx addresses were stopped from being mapped explicitly to DCT1 from F15h (OR) onwards. They use _dct[0:1] mechanism to access the registers. So we should move away from using address ranges to select DCT for these families. - On newer processors, the address ranges used to indicate DCT1 (0x140, 0x1a0) have different meanings than what is assumed currently. Changes introduced: - amd64_read_dct_pci_cfg() now takes in dct value and uses it for 'selecting the dct' - Update usage of the function. Keep in mind that different families have specific handling requirements - Remove [k8\|f10]_read_dct_pci_cfg() as they don't do much different from amd64_read_pci_cfg() - Move the k8 specific check to amd64_read_pci_cfg - Remove f15_read_dct_pci_cfg() and move logic to amd64_read_dct_pci_cfg() - Remove now needless .read_dct_pci_cfg Testing: - Tested on Fam 10h; Fam15h Models: 00h, 30h; Fam16h using 'EDAC_DEBUG' and mce_amd_inj - driver obtains info from F2x registers and caches it in pvt structures correctly - ECC decoding works fine Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1410799058-3149-1-git-send-email-aravind.gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2014-09-23 13:16:05 +02:00

1 2 3

120 Commits