1708 Commits

Author SHA1 Message Date
Qiuxu Zhuo
e951bdaa65 EDAC/skx: Fix overflows on the DRAM row address mapping arrays
[ Upstream commit 71b1e3ba3fed5a34c5fac6d3a15c2634b04c1eb7 ]

The current DRAM row address mapping arrays skx_{open,close}_row[]
only support ranks with sizes up to 16G. Decoding a rank address
to a DRAM row address for a 32G rank by using either one of the
above arrays by the skx_edac driver, will result in an overflow on
the array.

For a 32G rank, the most significant DRAM row address bit (the
bit17) is mapped from the bit34 of the rank address. Add this new
mapping item to both arrays to fix the overflow issue.

Fixes: 4ec656bdf43a ("EDAC, skx_edac: Add EDAC driver for Skylake")
Reported-by: Feng Xu <feng.f.xu@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20230211011728.71764-1-qiuxu.zhuo@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17 11:47:39 +02:00
Manivannan Sadhasivam
76d9ebb7f0 EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_info
commit 977c6ba624f24ae20cf0faee871257a39348d4a9 upstream.

The memory for llcc_driv_data is allocated by the LLCC driver. But when
it is passed as the private driver info to the EDAC core, it will get freed
during the qcom_edac driver release. So when the qcom_edac driver gets probed
again, it will try to use the freed data leading to the use-after-free bug.

Hence, do not pass llcc_driv_data as pvt_info but rather reference it
using the platform_data pointer in the qcom_edac driver.

Fixes: 27450653f1db ("drivers: edac: Add EDAC driver support for QCOM SoCs")
Reported-by: Steev Klimaszewski <steev@kali.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s
Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride
Cc: <stable@vger.kernel.org> # 4.20
Link: https://lore.kernel.org/r/20230118150904.26913-4-manivannan.sadhasivam@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01 08:23:23 +01:00
Manivannan Sadhasivam
511f6c7c40 EDAC/device: Respect any driver-supplied workqueue polling value
commit cec669ff716cc83505c77b242aecf6f7baad869d upstream.

The EDAC drivers may optionally pass the poll_msec value. Use that value
if available, else fall back to 1000ms.

  [ bp: Touchups. ]

Fixes: e27e3dac6517 ("drivers/edac: add edac_device class")
Reported-by: Luca Weiss <luca.weiss@fairphone.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s
Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride
Cc: <stable@vger.kernel.org> # 4.9
Link: https://lore.kernel.org/r/COZYL8MWN97H.MROQ391BGA09@otso
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01 08:23:23 +01:00
Miaoqian Lin
329fbd2603 EDAC/highbank: Fix memory leak in highbank_mc_probe()
[ Upstream commit e7a293658c20a7945014570e1921bf7d25d68a36 ]

When devres_open_group() fails, it returns -ENOMEM without freeing memory
allocated by edac_mc_alloc().

Call edac_mc_free() on the error handling path to avoid a memory leak.

  [ bp: Massage commit message. ]

Fixes: a1b01edb2745 ("edac: add support for Calxeda highbank memory controller")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20221229054825.1361993-1-linmq006@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01 08:23:09 +01:00
Eliav Farber
3ca8ef4d91 EDAC/device: Fix period calculation in edac_device_reset_delay_period()
commit e84077437902ec99eba0a6b516df772653f142c7 upstream.

Fix period calculation in case user sets a value of 1000.  The input of
round_jiffies_relative() should be in jiffies and not in milli-seconds.

  [ bp: Use the same code pattern as in edac_device_workq_setup() for
    clarity. ]

Fixes: c4cf3b454eca ("EDAC: Rework workqueue handling")
Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20221020124458.22153-1-farbere@amazon.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18 11:44:58 +01:00
Yang Yingliang
2db53c7059 EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper()
[ Upstream commit 9c8921555907f4d723f01ed2d859b66f2d14f08e ]

As the comment of pci_get_domain_bus_and_slot() says, it returns
a PCI device with refcount incremented, so it doesn't need to
call an extra pci_dev_get() in pci_get_dev_wrapper(), and the PCI
device needs to be put in the error path.

Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20221128065512.3572550-1-yangyingliang@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-14 10:15:18 +01:00
Toshi Kani
c4cd52ab1e EDAC/ghes: Set the DIMM label unconditionally
commit 5e2805d5379619c4a2e3ae4994e73b36439f4bad upstream.

The commit

  cb51a371d08e ("EDAC/ghes: Setup DIMM label from DMI and use it in error reports")

enforced that both the bank and device strings passed to
dimm_setup_label() are not NULL.

However, there are BIOSes, for example on a

  HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/15/2019

which don't populate both strings:

  Handle 0x0020, DMI type 17, 84 bytes
  Memory Device
          Array Handle: 0x0013
          Error Information Handle: Not Provided
          Total Width: 72 bits
          Data Width: 64 bits
          Size: 32 GB
          Form Factor: DIMM
          Set: None
          Locator: PROC 1 DIMM 1        <===== device
          Bank Locator: Not Specified   <===== bank

This results in a buffer overflow because ghes_edac_register() calls
strlen() on an uninitialized label, which had non-zero values left over
from krealloc_array():

  detected buffer overflow in __fortify_strlen
   ------------[ cut here ]------------
   kernel BUG at lib/string_helpers.c:983!
   invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
   CPU: 1 PID: 1 Comm: swapper/0 Tainted: G          I       5.18.6-200.fc36.x86_64 #1
   Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/15/2019
   RIP: 0010:fortify_panic
   ...
   Call Trace:
    <TASK>
    ghes_edac_register.cold
    ghes_probe
    platform_probe
    really_probe
    __driver_probe_device
    driver_probe_device
    __driver_attach
    ? __device_attach_driver
    bus_for_each_dev
    bus_add_driver
    driver_register
    acpi_ghes_init
    acpi_init
    ? acpi_sleep_proc_init
    do_one_initcall

The label contains garbage because the commit in Fixes reallocs the
DIMMs array while scanning the system but doesn't clear the newly
allocated memory.

Change dimm_setup_label() to always initialize the label to fix the
issue. Set it to the empty string in case BIOS does not provide both
bank and device so that ghes_edac_register() can keep the default label
given by edac_mc_alloc_dimms().

  [ bp: Rewrite commit message. ]

Fixes: b9cae27728d1f ("EDAC/ghes: Scan the system once on driver init")
Co-developed-by: Robert Richter <rric@kernel.org>
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Robert Elliott <elliott@hpe.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220719220124.760359-1-toshi.kani@hpe.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03 12:00:50 +02:00
Tyler Hicks
dd2b1d70ef EDAC/dmc520: Don't print an error for each unconfigured interrupt line
[ Upstream commit ad2df24732e8956a45a00894d2163c4ee8fb0e1f ]

The dmc520 driver requires that at least one interrupt line, out of the
ten possible, is configured. The driver prints an error and returns
-EINVAL from its .probe function if there are no interrupt lines
configured.

Don't print a KERN_ERR level message for each interrupt line that's
unconfigured as that can confuse users into thinking that there is an
error condition.

Before this change, the following KERN_ERR level messages would be
reported if only dram_ecc_errc and dram_ecc_errd were configured in the
device tree:

  dmc520 68000000.dmc: IRQ ram_ecc_errc not found
  dmc520 68000000.dmc: IRQ ram_ecc_errd not found
  dmc520 68000000.dmc: IRQ failed_access not found
  dmc520 68000000.dmc: IRQ failed_prog not found
  dmc520 68000000.dmc: IRQ link_err not
  dmc520 68000000.dmc: IRQ temperature_event not found
  dmc520 68000000.dmc: IRQ arch_fsm not found
  dmc520 68000000.dmc: IRQ phy_request not found

Fixes: 1088750d7839 ("EDAC: Add EDAC driver for DMC520")
Reported-by: Sinan Kaya <okaya@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220111163800.22362-1-tyhicks@linux.microsoft.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09 10:21:02 +02:00
Shubhrajyoti Datta
50cbc583fa EDAC/synopsys: Read the error count from the correct register
commit e2932d1f6f055b2af2114c7e64a26dc1b5593d0c upstream.

Currently, the error count is read wrongly from the status register. Read
the count from the proper error count register (ERRCNT).

  [ bp: Massage. ]

Fixes: b500b4a029d5 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220414102813.4468-1-shubhrajyoti.datta@xilinx.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-27 13:53:54 +02:00
Eliav Farber
595c259f75 EDAC: Fix calculation of returned address and next offset in edac_align_ptr()
commit f8efca92ae509c25e0a4bd5d0a86decea4f0c41e upstream.

Do alignment logic properly and use the "ptr" local variable for
calculating the remainder of the alignment.

This became an issue because struct edac_mc_layer has a size that is not
zero modulo eight, and the next offset that was prepared for the private
data was unaligned, causing an alignment exception.

The patch in Fixes: which broke this actually wanted to "what we
actually care about is the alignment of the actual pointer that's about
to be returned." But it didn't check that alignment.

Use the correct variable "ptr" for that.

  [ bp: Massage commit message. ]

Fixes: 8447c4d15e35 ("edac: Do alignment logic properly in edac_align_ptr()")
Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220113100622.12783-2-farbere@amazon.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 12:01:07 +01:00
Sergey Shtylyov
ef2053afd7 EDAC/xgene: Fix deferred probing
commit dfd0dfb9a7cc04acf93435b440dd34c2ca7b4424 upstream.

The driver overrides error codes returned by platform_get_irq_optional()
to -EINVAL for some strange reason, so if it returns -EPROBE_DEFER, the
driver will fail the probe permanently instead of the deferred probing.
Switch to propagating the proper error codes to platform driver code
upwards.

  [ bp: Massage commit message. ]

Fixes: 0d4429301c4a ("EDAC: Add APM X-Gene SoC EDAC driver")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220124185503.6720-3-s.shtylyov@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08 18:30:40 +01:00
Sergey Shtylyov
2a12faf55b EDAC/altera: Fix deferred probing
commit 279eb8575fdaa92c314a54c0d583c65e26229107 upstream.

The driver overrides the error codes returned by platform_get_irq() to
-ENODEV for some strange reason, so if it returns -EPROBE_DEFER, the
driver will fail the probe permanently instead of the deferred probing.
Switch to propagating the proper error codes to platform driver code
upwards.

  [ bp: Massage commit message. ]

Fixes: 71bcada88b0f ("edac: altera: Add Altera SDRAM EDAC support")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220124185503.6720-2-s.shtylyov@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08 18:30:40 +01:00
Dinh Nguyen
f54d8cd831 EDAC/synopsys: Use the quirk for version instead of ddr version
[ Upstream commit bd1d6da17c296bd005bfa656952710d256e77dd3 ]

Version 2.40a supports DDR_ECC_INTR_SUPPORT for a quirk, so use that
quirk to determine a call to setup_address_map().

Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lkml.kernel.org/r/20211012190709.1504152-1-dinguyen@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 10:54:11 +01:00
Yazen Ghannam
6880325382 EDAC/amd64: Handle three rank interleaving mode
[ Upstream commit 9f4873fb6af7966de8fcbd95c36b61351c1c4b1f ]

AMD Rome systems and later support interleaving between three identical
ranks within a channel.

Check for this mode by counting the number of enabled chip selects and
comparing their masks. If there are exactly three enabled chip selects
and their masks are identical, then three rank interleaving is enabled.

The size of a rank is determined from its mask value. However, three
rank interleaving doesn't follow the method of swapping an interleave
bit with the most significant bit. Rather, the interleave bit is flipped
and the most significant bit remains the same. There is only a single
interleave bit in this case.

Account for this when determining the chip select size by keeping the
most significant bit at its original value and ignoring any zero bits.
This will return a full bitmask in [MSB:1].

Fixes: e53a3b267fb0 ("EDAC/amd64: Find Chip Select memory size using Address Mask")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 14:04:06 +01:00
Eric Badger
a5d8d76710 EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
commit 537bddd069c743759addf422d0b8f028ff0f8dbc upstream.

The computation of TOHM is off by one bit. This missed bit results in
too low a value for TOHM, which can cause errors in regular memory to
incorrectly report:

  EDAC MC0: 1 CE Error at MMIOH area, on addr 0x000000207fffa680 on any memory

Fixes: 50d1bb93672f ("sb_edac: add support for Haswell based systems")
Cc: stable@vger.kernel.org
Reported-by: Meeta Saggi <msaggi@purestorage.com>
Signed-off-by: Eric Badger <ebadger@purestorage.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211010170127.848113-1-ebadger@purestorage.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 14:03:45 +01:00
Hans Potsch
a7bd0dd3f2 EDAC/armada-xp: Fix output of uncorrectable error counter
commit d9b7748ffc45250b4d7bcf22404383229bc495f5 upstream.

The number of correctable errors is displayed as uncorrectable
errors because the "SBE" error count is passed to both calls of
edac_mc_handle_error().

Pass the correct uncorrectable error count to the second
edac_mc_handle_error() call when logging uncorrectable errors.

 [ bp: Massage commit message. ]

Fixes: 7f6998a41257 ("ARM: 8888/1: EDAC: Add driver for the Marvell Armada XP SDRAM and L2 cache ECC")
Signed-off-by: Hans Potsch <hans.potsch@nokia.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20211006121332.58788-1-hans.potsch@nokia.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20 11:45:01 +02:00
Borislav Petkov
cc71740ee4 EDAC/dmc520: Assign the proper type to dimm->edac_mode
commit 54607282fae6148641a08d81a6e0953b541249c7 upstream.

dimm->edac_mode contains values of type enum edac_type - not the
corresponding capability flags. Fix that.

Fixes: 1088750d7839 ("EDAC: Add EDAC driver for DMC520")
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210916085258.7544-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-30 10:11:08 +02:00
Sai Krishna Potthuri
9afad85a43 EDAC/synopsys: Fix wrong value type assignment for edac_mode
commit 5297cfa6bdf93e3889f78f9b482e2a595a376083 upstream.

dimm->edac_mode contains values of type enum edac_type - not the
corresponding capability flags. Fix that.

Issue caught by Coverity check "enumerated type mixed with another
type."

 [ bp: Rewrite commit message, add tags. ]

Fixes: ae9b56e3996d ("EDAC, synps: Add EDAC support for zynq ddr ecc controller")
Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com>
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210818072315.15149-1-shubhrajyoti.datta@xilinx.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-30 10:11:08 +02:00
Qiuxu Zhuo
1e1423449d EDAC/i10nm: Fix NVDIMM detection
[ Upstream commit 2294a7299f5e51667b841f63c6d69474491753fb ]

MCDDRCFG is a per-channel register and uses bit{0,1} to indicate
the NVDIMM presence on DIMM slot{0,1}. Current i10nm_edac driver
wrongly uses MCDDRCFG as per-DIMM register and fails to detect
the NVDIMM.

Fix it by reading MCDDRCFG as per-channel register and using its
bit{0,1} to check whether the NVDIMM is populated on DIMM slot{0,1}.

Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Reported-by: Fan Du <fan.du@intel.com>
Tested-by: Wen Jin <wen.jin@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210818175701.1611513-2-tony.luck@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:50:30 +02:00
Smita Koralahalli
8a6c5eec81 EDAC/mce_amd: Do not load edac_mce_amd module on guests
[ Upstream commit 767f4b620edadac579c9b8b6660761d4285fa6f9 ]

Hypervisors likely do not expose the SMCA feature to the guest and
loading this module leads to false warnings. This module should not be
loaded in guests to begin with, but people tend to do so, especially
when testing kernels in VMs. And then they complain about those false
warnings.

Do the practical thing and do not load this module when running as a
guest to avoid all that complaining.

 [ bp: Rewrite commit message. ]

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lkml.kernel.org/r/20210628172740.245689-1-Smita.KoralahalliChannabasappa@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:50:24 +02:00
Luck, Tony
f5a90d44a1 EDAC/Intel: Do not load EDAC driver when running as a guest
[ Upstream commit f0a029fff4a50eb01648810a77ba1873e829fdd4 ]

There's little to no point in loading an EDAC driver running in a guest:
1) The CPU model reported by CPUID may not represent actual h/w
2) The hypervisor likely does not pass in access to memory controller devices
3) Hypervisors generally do not pass corrected error details to guests

Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR
and simply return -ENODEV in the init routine.

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210615174419.GA1087688@agluck-desk2.amr.corp.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:56:00 +02:00
Bixuan Cui
ae281fbbc4 EDAC/ti: Add missing MODULE_DEVICE_TABLE
[ Upstream commit 0a37f32ba5272b2d4ec8c8d0f6b212b81b578f7e ]

The module misses MODULE_DEVICE_TABLE() for of_device_id tables and thus
never autoloads on ID matches.

Add the missing declaration.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tero Kristo <kristo@kernel.org>
Link: https://lkml.kernel.org/r/20210512033727.26701-1-cuibixuan@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:55:57 +02:00
Borislav Petkov
a03583775a EDAC/amd64: Do not load on family 0x15, model 0x13
[ Upstream commit 6c13d7ff81e6d2f01f62ccbfa49d1b8d87f274d0 ]

Those were only laptops and are very very unlikely to have ECC memory.
Currently, when the driver attempts to load, it issues:

  EDAC amd64: Error: F1 not found: device 0x1601 (broken BIOS?)

because the PCI device is the wrong one (it uses the F15h default one).

So do not load the driver on them as that is pointless.

Reported-by: Don Curtis <bugrprt21882@online.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Don Curtis <bugrprt21882@online.de>
Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1179763
Link: https://lkml.kernel.org/r/20201218160622.20146-1-bp@alien8.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07 12:34:08 +01:00
Borislav Petkov
eae95da7fc EDAC/amd64: Fix PCI component registration
commit 706657b1febf446a9ba37dc51b89f46604f57ee9 upstream.

In order to setup its PCI component, the driver needs any node private
instance in order to get a reference to the PCI device and hand that
into edac_pci_create_generic_ctl(). For convenience, it uses the 0th
memory controller descriptor under the assumption that if any, the 0th
will be always present.

However, this assumption goes wrong when the 0th node doesn't have
memory and the driver doesn't initialize an instance for it:

  EDAC amd64: F17h detected (node 0).
  ...
  EDAC amd64: Node 0: No DIMMs detected.

But looking up node instances is not really needed - all one needs is
the pointer to the proper device which gets discovered during instance
init.

So stash that pointer into a variable and use it when setting up the
EDAC PCI component.

Clear that variable when the driver needs to unwind due to some
instances failing init to avoid any registration imbalance.

Cc: <stable@vger.kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:54:11 +01:00
Qiuxu Zhuo
3a881be1b4 EDAC/i10nm: Use readl() to access MMIO registers
commit 83ff51c4e3fecf6b8587ce4d46f6eac59f5d7c5a upstream.

Instead of raw access, use readl() to access MMIO registers of
memory controller to avoid possible compiler re-ordering.

Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:54:11 +01:00
Yazen Ghannam
3897b71e1a EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
[ Upstream commit 8de0c9917cc1297bc5543b61992d5bdee4ce621a ]

The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and
later systems. This function is used in amd64_edac_mod to do
system-specific decoding for DRAM ECC errors. The function takes a
"NodeId" as a parameter.

In AMD documentation, NodeId is used to identify a physical die in a
system. This can be used to identify a node in the AMD_NB code and also
it is used with umc_normaddr_to_sysaddr().

However, the input used for decode_dram_ecc() is currently the NUMA node
of a logical CPU. In the default configuration, the NUMA node and
physical die will be equivalent, so this doesn't have an impact.

But the NUMA node configuration can be adjusted with optional memory
interleaving modes. This will cause the NUMA node enumeration to not
match the physical die enumeration. The mismatch will cause the address
translation function to fail or report incorrect results.

Use struct cpuinfo_x86.cpu_die_id for the node_id parameter to ensure the
physical ID is used.

Fixes: fbe63acf62f5 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-4-Yazen.Ghannam@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:16 +01:00
Linus Torvalds
e6412f9833 EFI changes for v5.10:
- Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree.
 
  - Relax decompressed image placement rules for 32-bit ARM
 
  - Add support for passing MOK certificate table contents via a config table
    rather than a EFI variable.
 
  - Add support for 18 bit DIMM row IDs in the CPER records.
 
  - Work around broken Dell firmware that passes the entire Boot#### variable
    contents as the command line
 
  - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can
    identify it in the memory map listings.
 
  - Don't abort the boot on arm64 if the EFI RNG protocol is available but
    returns with an error
 
  - Replace slashes with exclamation marks in efivarfs file names
 
  - Split efi-pstore from the deprecated efivars sysfs code, so we can
    disable the latter on !x86.
 
  - Misc fixes, cleanups and updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ec9QRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inTQ//TYj3kJq/7sWfUAxmAsWnUEC005YCNf0T
 x3kJQv3zYX4Rl4eEwkff8S1PrqqvwUP5yUZYApp8HD9s9CYvzz5iG5xtf/jX+QaV
 06JnTMnkoycx2NaOlbr1cmcIn4/cAhQVYbVCeVrlf7QL8enNTBr5IIQmo4mgP8Lc
 mauSsO1XU8ZuMQM+JcZSxAkAPxlhz3dbR5GteP4o2K4ShQKpiTCOfOG1J3FvUYba
 s1HGnhHFlkQr6m3pC+iG8dnAG0YtwHMH1eJVP7mbeKUsMXz944U8OVXDWxtn81pH
 /Xt/aFZXnoqwlSXythAr6vFTuEEn40n+qoOK6jhtcGPUeiAFPJgiaeAXw3gO0YBe
 Y8nEgdGfdNOMih94McRd4M6gB/N3vdqAGt+vjiZSCtzE+nTWRyIXSGCXuDVpkvL4
 VpEXpPINnt1FZZ3T/7dPro4X7pXALhODE+pl36RCbfHVBZKRfLV1Mc1prAUGXPxW
 E0MfaM9TxDnVhs3VPWlHmRgavee2MT1Tl/ES4CrRHEoz8ZCcu4MfROQyao8+Gobr
 VR+jVk+xbyDrykEc6jdAK4sDFXpTambuV624LiKkh6Mc4yfHRhPGrmP5c5l7SnCd
 aLp+scQ4T7sqkLuYlXpausXE3h4sm5uur5hNIRpdlvnwZBXpDEpkzI8x0C9OYr0Q
 kvFrreQWPLQ=
 =ZNI8
 -----END PGP SIGNATURE-----

Merge tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI changes from Ingo Molnar:

 - Preliminary RISC-V enablement - the bulk of it will arrive via the
   RISCV tree.

 - Relax decompressed image placement rules for 32-bit ARM

 - Add support for passing MOK certificate table contents via a config
   table rather than a EFI variable.

 - Add support for 18 bit DIMM row IDs in the CPER records.

 - Work around broken Dell firmware that passes the entire Boot####
   variable contents as the command line

 - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we
   can identify it in the memory map listings.

 - Don't abort the boot on arm64 if the EFI RNG protocol is available
   but returns with an error

 - Replace slashes with exclamation marks in efivarfs file names

 - Split efi-pstore from the deprecated efivars sysfs code, so we can
   disable the latter on !x86.

 - Misc fixes, cleanups and updates.

* tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  efi: mokvar: add missing include of asm/early_ioremap.h
  efi: efivars: limit availability to X86 builds
  efi: remove some false dependencies on CONFIG_EFI_VARS
  efi: gsmi: fix false dependency on CONFIG_EFI_VARS
  efi: efivars: un-export efivars_sysfs_init()
  efi: pstore: move workqueue handling out of efivars
  efi: pstore: disentangle from deprecated efivars module
  efi: mokvar-table: fix some issues in new code
  efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure
  efivarfs: Replace invalid slashes with exclamation marks in dentries.
  efi: Delete deprecated parameter comments
  efi/libstub: Fix missing-prototypes in string.c
  efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it
  cper,edac,efi: Memory Error Record: bank group/address and chip id
  edac,ghes,cper: Add Row Extension to Memory Error Record
  efi/x86: Add a quirk to support command line arguments on Dell EFI firmware
  efi/libstub: Add efi_warn and *_once logging helpers
  integrity: Load certs from the EFI MOK config table
  integrity: Move import of MokListRT certs to a separate routine
  efi: Support for MOK variable config table
  ...
2020-10-12 13:26:49 -07:00
Linus Torvalds
ca1b66922a * Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory by
 sending them a SIGBUS on return to user space and umapping the faulty
 memory, by Tony Luck and Youquan Song.
 
 * memcpy_mcsafe() rework by splitting the functionality into
 copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
 support for new hardware which can recover from a machine check
 encountered during a fast string copy and makes that the default and
 lets the older hardware which does not support that advance recovery,
 opt in to use the old, fragile, slow variant, by Dan Williams.
 
 * New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
 
 * Do not use MSR-tracing accessors in #MC context and flag any fault
 while accessing MCA architectural MSRs as an architectural violation
 with the hope that such hw/fw misdesigns are caught early during the hw
 eval phase and they don't make it into production.
 
 * Misc fixes, improvements and cleanups, as always.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
 VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
 sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
 H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
 D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
 iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
 V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
 VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
 kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
 lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
 fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
 whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
 =u1Wg
 -----END PGP SIGNATURE-----

Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Extend the recovery from MCE in kernel space also to processes which
   encounter an MCE in kernel space but while copying from user memory
   by sending them a SIGBUS on return to user space and umapping the
   faulty memory, by Tony Luck and Youquan Song.

 - memcpy_mcsafe() rework by splitting the functionality into
   copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
   support for new hardware which can recover from a machine check
   encountered during a fast string copy and makes that the default and
   lets the older hardware which does not support that advance recovery,
   opt in to use the old, fragile, slow variant, by Dan Williams.

 - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.

 - Do not use MSR-tracing accessors in #MC context and flag any fault
   while accessing MCA architectural MSRs as an architectural violation
   with the hope that such hw/fw misdesigns are caught early during the
   hw eval phase and they don't make it into production.

 - Misc fixes, improvements and cleanups, as always.

* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
  x86/mce: Decode a kernel instruction to determine if it is copying from user
  x86/mce: Recover from poison found while copying from user space
  x86/mce: Avoid tail copy when machine check terminated a copy from user
  x86/mce: Add _ASM_EXTABLE_CPY for copy user access
  x86/mce: Provide method to find out the type of an exception handler
  x86/mce: Pass pointer to saved pt_regs to severity calculation routines
  x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
  x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
  x86/mce: Add Skylake quirk for patrol scrub reported errors
  RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
  x86/mce: Annotate mce_rd/wrmsrl() with noinstr
  x86/mce/dev-mcelog: Do not update kflags on AMD systems
  x86/mce: Stop mce_reign() from re-computing severity for every CPU
  x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
  x86/mce: Increase maximum number of banks to 64
  x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
  x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
  RAS/CEC: Fix cec_init() prototype
2020-10-12 10:14:38 -07:00
Linus Torvalds
a9a4b7d9a6 * Add Amazon's Annapurna Labs memory controller EDAC driver, by Talel
Shenhar.
 
 * New AMD CPUs support, by Yazen Ghannam.
 
 * The usual misc fixes and cleanups all over the subsystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EHJoACgkQEsHwGGHe
 VUoZwQ//XrPtvR/ADQpFnh++WwcmbbbqA6yruBy8zyCVHveSG9dukl4pITO21CGQ
 QGvOGdV3aEwDNHyrErqzWEK2o8RF28LYeTpRtaRDdH7ozk7Ua8Jqxoj7JSijz2Mv
 iMh5Kn0v3+EcOMckbFBBjwf+yIPUw7/HwHDqsQ9K+F7drplnkRxU6mDyurYTTKJJ
 Y00iOTp/PY7rZeHXmQAgbxvmDfv6/Xoo15ZKtzkHGtG7xW6y3n2NU2tegZizI7kA
 LDx0g4lmZQZOsv2DR3ZpeLclsq9pYWCYga3P4+DJkVAAdJJ4qtk+XyKFMzyMhVOa
 AyDJArSRXZsUFzbaMe1hlGknrSeXDTJHlZ7FjlpJHAudAohZlNCOEIS0VQzXwYjN
 0FeR9rjinhaSOkMW1S3bBKssciCEs93a0oMASvHbzy4NGAM97YyRDZof1Iu3/jWX
 20YRkSXyFhqSSRznJUh58kKk1f85B2ikySB4b3mZatPEiDJvF3I2KS/j3BFKoYE6
 9JT+eP1+4FyEcobG6e8VE8BZw7+sw0A6eYisAbCLCV5oaQMdR946BEXZQKaFHDib
 zKaveB4I/xbiLghYl61mMNSPtGD4czs+JR7/QEP5Q40VvwNJKjh1oHkt7hZQAgQQ
 yv7JkstTpHQK/B8V6exEfBbCXQekItsVSrryPPW4gfmqncgyEVU=
 =ROcp
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add Amazon's Annapurna Labs memory controller EDAC driver (Talel
   Shenhar)

 - New AMD CPUs support (Yazen Ghannam)

 - The usual misc fixes and cleanups all over the subsystem

* tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
  EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
  EDAC/aspeed: Use module_platform_driver() to simplify
  EDAC, sb_edac: Simplify switch statement
  EDAC/ti: Fix handling of platform_get_irq() error
  EDAC/aspeed: Fix handling of platform_get_irq() error
  EDAC/i5100: Fix error handling order in i5100_init_one()
  EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara
  EDAC/socfpga: Transfer SoCFPGA EDAC maintainership
  EDAC/thunderx: Make symbol lmc_dfs_ents static
  EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
  dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding
  EDAC/mce_amd: Add new error descriptions for existing types
  EDAC: Replace HTTP links with HTTPS ones
2020-10-12 10:12:26 -07:00
Borislav Petkov
1dc32628d6 Merge branch 'edac-drivers' into edac-updates-for-v5.10
Signed-off-by: Borislav Petkov <bp@suse.de>
2020-10-12 11:05:42 +02:00
Yazen Ghannam
b4210eab91 EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models
70h-7Fh. The same family ops and number of channels also apply.

Use the Family17h Model 70h family_type and ops for Family 19h Models
20h-2Fh. Update the controller name to match the system.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
2020-10-09 19:28:14 +02:00
Xiongfeng Wang
e6bbde8b2b EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
Reading those sysfs entries gives:

  [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/max_location
  memory 3 [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/dimm0/dimm_location
  memory 0 [root@localhost /]#

Add newlines after the value it prints for better readability.

  [ bp: Make len a signed int and change the check to catch wraparound.
    Increment the pointer p only when the length check passes. Use
    scnprintf(). ]

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1600051734-8993-1-git-send-email-wangxiongfeng2@huawei.com
2020-09-18 09:14:01 +02:00
Liu Shixin
07def58717 EDAC/aspeed: Use module_platform_driver() to simplify
Use module_platform_driver() which makes the code simpler by eliminating
boilerplate code.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lkml.kernel.org/r/20200914065358.3726216-1-liushixin2@huawei.com
2020-09-18 09:14:01 +02:00
Alex Kluver
612b5d506d cper,edac,efi: Memory Error Record: bank group/address and chip id
Updates to the UEFI 2.8 Memory Error Record allow splitting the bank field
into bank address and bank group, and using the last 3 bits of the extended
field as a chip identifier.

When needed, print correct version of bank field, bank group, and chip
identification.

Based on UEFI 2.8 Table 299. Memory Error Record.

Signed-off-by: Alex Kluver <alex.kluver@hpe.com>
Reviewed-by: Russ Anderson <russ.anderson@hpe.com>
Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-3-alex.kluver@hpe.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-17 10:19:52 +03:00
Alex Kluver
9baf68cc45 edac,ghes,cper: Add Row Extension to Memory Error Record
Memory errors could be printed with incorrect row values since the DIMM
size has outgrown the 16 bit row field in the CPER structure. UEFI
Specification Version 2.8 has increased the size of row by allowing it to
use the first 2 bits from a previously reserved space within the structure.

When needed, add the extension bits to the row value printed.

Based on UEFI 2.8 Table 299. Memory Error Record

Signed-off-by: Alex Kluver <alex.kluver@hpe.com>
Tested-by: Russ Anderson <russ.anderson@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-2-alex.kluver@hpe.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-17 10:19:52 +03:00
Borislav Petkov
251c54ea26 EDAC/ghes: Check whether the driver is on the safe list correctly
With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe,
unregister and probe again a driver.

When ghes_edac is attempted to be loaded on a system which is not on
the safe platforms list, ghes_edac_register() would return early. The
unregister counterpart ghes_edac_unregister() would still attempt to
unregister and exit early at the refcount test, leading to the refcount
underflow below.

In order to not do *anything* on the unregister path too, reuse the
force_load parameter and check it on that path too, before fumbling with
the refcount.

  ghes_edac: ghes_edac_register: entry
  ghes_edac: ghes_edac_register: return -ENODEV
  ------------[ cut here ]------------
  refcount_t: underflow; use-after-free.
  WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100
  Modules linked in:
  CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12
  Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
  RIP: 0010:refcount_warn_saturate+0xb9/0x100
  Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c
  RSP: 0018:ffffc90000037d58 EFLAGS: 00010292
  RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff
  RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000
  R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0
  Call Trace:
   ghes_edac_unregister
   ghes_remove
   platform_drv_remove
   really_probe
   driver_probe_device
   device_driver_attach
   __driver_attach
   ? device_driver_attach
   ? device_driver_attach
   bus_for_each_dev
   bus_add_driver
   driver_register
   ? bert_init
   ghes_init
   do_one_initcall
   ? rcu_read_lock_sched_held
   kernel_init_freeable
   ? rest_init
   kernel_init
   ret_from_fork
   ...
  ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic
2020-09-15 09:42:15 +02:00
Borislav Petkov
cd8100f1f3 EDAC/ghes: Clear scanned data on unload
Commit

  b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")

didn't clear all the information from the scanned system and, more
specifically, left ghes_hw.num_dimms to its previous value. On a
second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use
the leftover num_dimms value which is not 0 and thus the 0 check in
enumerate_dimms() will get bypassed and it would go directly to the
pointer deref:

  d = &hw->dimms[hw->num_dimms];

which is, of course, NULL:

  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] PREEMPT SMP
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7
  Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
  RIP: 0010:enumerate_dimms.cold+0x7b/0x375

Reset the whole ghes_hw on driver unregister so that no stale values are
used on a second system scan.

Fixes: b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
Cc: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
2020-09-15 09:41:28 +02:00
Tom Rix
fbd4ab7802 EDAC, sb_edac: Simplify switch statement
clang static analyzer reports this problem

sb_edac.c:959:2: warning: Undefined or garbage value
  returned to caller
        return type;
        ^~~~~~~~~~~

This is a false positive.

However by initializing the type to DEV_UNKNOWN the 3 case can be
removed from the switch, saving a comparison and jump.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com
2020-09-08 14:56:17 -07:00
Krzysztof Kozlowski
66077adb70 EDAC/ti: Fix handling of platform_get_irq() error
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.

 [ bp: Massage commit message. ]

Fixes: 86a18ee21e5e ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tero Kristo <t-kristo@ti.com>
Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org
2020-09-01 20:43:20 +02:00
Krzysztof Kozlowski
afce699694 EDAC/aspeed: Fix handling of platform_get_irq() error
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.

 [ bp: Massage commit message. ]

Fixes: 9b7e6242ee4e ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Stefan Schaeckeler <schaecsn@gmx.net>
Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org
2020-09-01 20:41:27 +02:00
Dinghao Liu
857a3139bd EDAC/i5100: Fix error handling order in i5100_init_one()
When pci_get_device_func() fails, the driver doesn't need to execute
pci_dev_put(). mci should still be freed, though, to prevent a memory
leak. When pci_enable_device() fails, the error injection PCI device
"einj" doesn't need to be disabled either.

 [ bp: Massage commit message, rename label to "bail_mc_free". ]

Fixes: 52608ba205461 ("i5100_edac: probe for device 19 function 0")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn
2020-09-01 12:10:19 +02:00
Linus Torvalds
42df60fcdf A fix to properly clear ghes_edac driver state on driver remove so that
a subsequent load can probe the system properly; by Shiju Jose.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl9LTFoACgkQEsHwGGHe
 VUpAxQ//YXxtNRrGxTQa6TGcHE0wRif/EkA20O54zWGPt+kolW9mXmsUFweDi0PH
 U6EW1To9H7vukfynCxc+TFGNG8oh3B6XXhaMILsNi691m418Z52a+wOHGaf/qLpX
 fqJe3WIIvo97cFtJJJvdHW90NaswtIo0TUspYqUiyThjwI3a9XXmY6Vh7odtVQaP
 M/tYt90By60KAR57zROifo6binSOO45zCOJiEO6Qm8X9UphXhShv6oL4TQvlEEwf
 ydWSz5NZwPnnVPk4MATMIQoptrixNaDXAfmh2kLBuNNmeNLCHKGous9C7ErI+Pqo
 ZH5oECXGOvJELRL3CJ/6fgb9lKaUg9z0IILsRhmoSqj6AYNcDAreWyi4wjJTaRTs
 rGNzYGHgB1tlWIpFxHBoxl3u/0EaRatkfvHYnY/i9LlmbIcNQgS+vBxV8LNyCRnk
 uxzJDIc8adPx+4RwmzQrnjwN6A52d50JfhVXakQykTWxTaKFg2qkEDUgpEJ4bMCr
 i77H1ytr+nGuR4BL9mferTfZqSWZULiQkeDV+skKAx8sSryOrKloucDXJYUzWFDU
 6MfKi4iqfNH+w0LQxDE4cRGfOID1tEz7WWI1ROflI17sD2zW+/WkgYoHj7nD2Vwy
 xm5DD91Jst4rdDd53PqJhiB3sz/V9VGlDL7o+BsYzYagFP9pyfc=
 =WMYq
 -----END PGP SIGNATURE-----

Merge tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:
 "A fix to properly clear ghes_edac driver state on driver remove so
  that a subsequent load can probe the system properly (Shiju Jose)"

* tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
2020-08-30 10:47:23 -07:00
Shiju Jose
b972fdba86 EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
After

  b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init")

and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes
a NULL pointer after the second ->probe() (aka ghes_edac_register())
which the config option causes to be called.

This happens because the static variable which holds down whether
the system has been scanned already, doesn't get reset in
ghes_edac_unregister(). Then, on the second probe, ghes_scan_system()
doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining
NULL.

Clear the variable and rename it to something more descriptive so that a
second probe succeeds.

 [ bp: Rewrite commit message. ]

Fixes: b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init")
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
2020-08-27 18:04:07 +02:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Yazen Ghannam
368d188720 x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type
was intended to be used by the kernel to filter out invalid error codes
on a system. However, this is unnecessary after a few product releases
because the hardware will only report valid error codes. Thus, there's
no need for it with future systems.

Remove the xec_bitmap field and all references to it.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com
2020-08-20 10:34:38 +02:00
Tony Luck
45bc6098a3 EDAC/{i7core,sb,pnd2,skx}: Fix error event severity
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.

Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").

Reverse the tests so that the error is marked fatal when RIPV is not set.

Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
2020-08-18 15:40:30 +02:00
Wei Yongjun
bd17e0b771 EDAC/thunderx: Make symbol lmc_dfs_ents static
Symbol 'lmc_dfs_ents' is not used outside of thunderx_edac.c, so
make it static:

  drivers/edac/thunderx_edac.c:457:22: warning:
   symbol 'lmc_dfs_ents' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Robert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200714142308.46612-1-weiyongjun1@huawei.com
2020-08-17 10:35:46 +02:00
Talel Shenhar
e23a7cdeb3 EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability
for error detection and correction (Single bit error correction, Double
detection). This driver introduces EDAC driver for that capability.

 [ bp: Remove "EDAC" string from Kconfig tristate as it is redundant. ]

Signed-off-by: Talel Shenhar <talel@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Link: https://lkml.kernel.org/r/20200816185551.19108-3-talel@amazon.com
2020-08-17 10:10:29 +02:00
Yazen Ghannam
dc7a8476cf EDAC/mce_amd: Add new error descriptions for existing types
A few existing MCA bank types will have new error types in future SMCA
systems.

Add the descriptions for the new error types.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200708153515.1911642-1-Yazen.Ghannam@amd.com
2020-08-17 09:36:00 +02:00
Alexander A. Klimov
7d4c1ea2be EDAC: Replace HTTP links with HTTPS ones
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

  Deterministic algorithm:
  For each file:
    If not .svg:
      For each line:
        If doesn't contain `\bxmlns\b`:
          For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
  	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
              If both the HTTP and HTTPS versions
              return 200 OK and serve the same content:
                Replace HTTP with HTTPS.

 [ bp: Merge all EDAC patches into a single one. ]

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac
Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de
2020-08-17 09:31:19 +02:00