linux/drivers/base
Jiaqi Yan 44b8f8bf24 mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.

Background
==========

In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators.  The statistics include
2 categories:

* Memory error statistics, for example, how many memory error are
  encountered, how many of them are recovered by the kernel.  Note these
  memory errors are non-fatal to kernel: during the machine check
  exception (MCE) handling kernel already classified MCE's severity to be
  unnecessary to panic (but either action required or optional).

* Scanner statistics, for example how many times the scanner have fully
  scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========

Memory error stats are important to userspace but insufficient in kernel
today.  Datacenter administrators can better monitor a machine's memory
health with the visible stats.  For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer.  Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action. 
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
  capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector.  Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============

As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner.  Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector.  It exposes the following per NUMA node memory error
counters:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log.  The following math holds for the statistics:

* total = recovered + ignored + failed + delayed

These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries.  The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery.  The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6


This patch (of 3):

Today kernel provides following memory error info to userspace, but each
has its own disadvantage

* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though

* ras:memory_failure_event: only available after explicitly enabled

* /dev/mcelog provides many useful info about the MCEs, but
  doesn't capture how memory_failure recovered memory MCEs

* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  The following math
holds for the statistics:

* total = recovered + ignored + failed + delayed

Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
..
firmware_loader Kbuild updates for v6.2 2022-12-19 12:33:32 -06:00
power Power management updates for 6.2-rc1 2022-12-12 13:19:07 -08:00
regmap regmap: Merge fix for where we get the number of registers from 2022-12-12 11:50:58 +00:00
test device property: Add a blank line in Kconfig of tests 2022-11-23 19:35:31 +01:00
arch_numa.c mm: percpu: add generic pcpu_populate_pte() function 2022-01-20 08:52:52 +02:00
arch_topology.c RISC-V Patches for the 6.1 Merge Window, Part 1 2022-10-09 13:24:01 -07:00
attribute_container.c
auxiliary.c Documentation/auxiliary_bus: Move the text into the code 2021-12-03 16:41:50 +01:00
base.h driver core: mark driver_allows_async_probing static 2022-11-10 18:31:04 +01:00
bus.c kobject: kset_uevent_ops: make filter() callback take a const * 2022-11-22 17:34:46 +01:00
cacheinfo.c cacheinfo: Remove of_node_put() for fw_token 2022-11-23 20:03:51 +01:00
class.c kobject: make kobject_get_ownership() take a constant kobject * 2022-11-22 17:34:29 +01:00
component.c component: Add common helper for compare/release functions 2022-02-25 12:16:12 +01:00
container.c
core.c kobject: kset_uevent_ops: make name() callback take a const * 2022-11-22 17:34:52 +01:00
cpu.c x86/bugs: Report AMD retbleed vulnerability 2022-06-27 10:33:59 +02:00
dd.c driver core: Fix bus_type.match() error handling in __driver_attach() 2022-11-10 18:36:04 +01:00
devcoredump.c devcoredump : Serialize devcd_del work 2022-09-24 14:01:40 +02:00
devres.c devres: Use kmalloc_size_roundup() to match ksize() usage 2022-11-09 15:11:46 +01:00
devtmpfs.c devtmpfs: fix the dangling pointer of global devtmpfsd thread 2022-06-27 16:41:13 +02:00
driver.c driver core: fix driver_set_override() issue with empty strings 2022-09-05 13:01:34 +02:00
firmware.c
hypervisor.c
init.c init: Initialize noop_backing_dev_info early 2022-06-16 10:55:57 +02:00
isa.c bus: Make remove callback return void 2021-07-21 11:53:42 +02:00
Kconfig devtmpfs: mount with noexec and nosuid 2021-12-30 13:54:42 +01:00
Makefile genirq: Get rid of GENERIC_MSI_IRQ_DOMAIN 2022-11-17 15:15:20 +01:00
map.c driver: base: Prefer unsigned int to bare use of unsigned 2021-07-21 17:30:09 +02:00
memory.c mm/hwpoison: introduce per-memory_block hwpoison counter 2022-11-08 17:37:22 -08:00
module.c
node.c mm: memory-failure: add memory failure stats to sysfs 2023-02-02 22:33:28 -08:00
physical_location.c driver core: location: Add "back" as a possible output for panel 2022-05-19 19:28:32 +02:00
physical_location.h driver core: Add sysfs support for physical location of a device 2022-04-27 09:51:57 +02:00
pinctrl.c
platform-msi.c platform-msi: Switch to the domain id aware MSI interfaces 2022-12-05 19:21:00 +01:00
platform.c platform: use fwnode_irq_get_byname instead of of_irq_get_byname to get irq 2022-11-10 18:56:47 +01:00
property.c device property: Fix documentation for fwnode_get_next_parent() 2022-12-07 17:22:44 +01:00
soc.c base: soc: Make soc_device_match() simpler and easier to read 2022-03-18 14:28:07 +01:00
swnode.c software node: fix wrong node passed to find nargs_prop 2021-12-22 18:26:18 +01:00
syscore.c
topology.c drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist 2022-07-15 17:36:33 +02:00
trace.c devres: Enable trace events 2021-06-15 17:14:36 +02:00
trace.h devres: Enable trace events 2021-06-15 17:14:36 +02:00
transport_class.c