ABI: sysfs-mce: add a new ABI file
Reduce the gap of missing ABIs for Intel servers with MCE by adding a new ABI file. The contents of this file comes from: Documentation/x86/x86_64/machinecheck.rst Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/801a26985e32589eb78ba4b728d3e19fdea18f04.1632994837.git.mchehab+huawei@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit is contained in:
parent
df2205de92
commit
edfc8730ba
107
Documentation/ABI/testing/sysfs-mce
Normal file
107
Documentation/ABI/testing/sysfs-mce
Normal file
@ -0,0 +1,107 @@
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
(X = CPU number)
|
||||
|
||||
Machine checks report internal hardware error conditions
|
||||
detected by the CPU. Uncorrected errors typically cause a
|
||||
machine check (often with panic), corrected ones cause a
|
||||
machine check log entry.
|
||||
|
||||
For more details about the x86 machine check architecture
|
||||
see the Intel and AMD architecture manuals from their
|
||||
developer websites.
|
||||
|
||||
For more details about the architecture
|
||||
see http://one.firstfloor.org/~andi/mce.pdf
|
||||
|
||||
Each CPU has its own directory.
|
||||
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/bank<Y>
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
(Y bank number)
|
||||
|
||||
64bit Hex bitmask enabling/disabling specific subevents for
|
||||
bank Y.
|
||||
|
||||
When a bit in the bitmask is zero then the respective
|
||||
subevent will not be reported.
|
||||
|
||||
By default all events are enabled.
|
||||
|
||||
Note that BIOS maintain another mask to disable specific events
|
||||
per bank. This is not visible here
|
||||
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/check_interval
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
The entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
How often to poll for corrected machine check errors, in
|
||||
seconds (Note output is hexadecimal). Default 5 minutes.
|
||||
When the poller finds MCEs it triggers an exponential speedup
|
||||
(poll more often) on the polling interval. When the poller
|
||||
stops finding MCEs, it triggers an exponential backoff
|
||||
(poll less often) on the polling interval. The check_interval
|
||||
variable is both the initial and maximum polling interval.
|
||||
0 means no polling for corrected machine check errors
|
||||
(but some corrected errors might be still reported
|
||||
in other ways)
|
||||
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/tolerant
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
The entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
Tolerance level. When a machine check exception occurs for a
|
||||
non corrected machine check the kernel can take different
|
||||
actions.
|
||||
|
||||
Since machine check exceptions can happen any time it is
|
||||
sometimes risky for the kernel to kill a process because it
|
||||
defies normal kernel locking rules. The tolerance level
|
||||
configures how hard the kernel tries to recover even at some
|
||||
risk of deadlock. Higher tolerant values trade potentially
|
||||
better uptime with the risk of a crash or even corruption
|
||||
(for tolerant >= 3).
|
||||
|
||||
== ===========================================================
|
||||
0 always panic on uncorrected errors, log corrected errors
|
||||
1 panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2 SIGBUS or log uncorrected errors, log corrected errors
|
||||
3 never panic or SIGBUS, log all errors (for testing only)
|
||||
== ===========================================================
|
||||
|
||||
Default: 1
|
||||
|
||||
Note this only makes a difference if the CPU allows recovery
|
||||
from a machine check exception. Current x86 CPUs generally
|
||||
do not.
|
||||
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/trigger
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
The entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
Program to run when a machine check event is detected.
|
||||
This is an alternative to running mcelog regularly from cron
|
||||
and allows to detect events faster.
|
||||
|
||||
What: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout
|
||||
Contact: Andi Kleen <ak@linux.intel.com>
|
||||
Date: Feb, 2007
|
||||
Description:
|
||||
How long to wait for the other CPUs to machine check too on a
|
||||
exception. 0 to disable waiting for other CPUs.
|
||||
|
||||
Unit: us
|
||||
|
@ -21,60 +21,8 @@ from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
||||
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
||||
(N = CPU number).
|
||||
|
||||
The directory contains some configurable entries:
|
||||
|
||||
bankNctl
|
||||
(N bank number)
|
||||
|
||||
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
||||
When a bit in the bitmask is zero then the respective
|
||||
subevent will not be reported.
|
||||
By default all events are enabled.
|
||||
Note that BIOS maintain another mask to disable specific events
|
||||
per bank. This is not visible here
|
||||
|
||||
The following entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
check_interval
|
||||
How often to poll for corrected machine check errors, in seconds
|
||||
(Note output is hexadecimal). Default 5 minutes. When the poller
|
||||
finds MCEs it triggers an exponential speedup (poll more often) on
|
||||
the polling interval. When the poller stops finding MCEs, it
|
||||
triggers an exponential backoff (poll less often) on the polling
|
||||
interval. The check_interval variable is both the initial and
|
||||
maximum polling interval. 0 means no polling for corrected machine
|
||||
check errors (but some corrected errors might be still reported
|
||||
in other ways)
|
||||
|
||||
tolerant
|
||||
Tolerance level. When a machine check exception occurs for a non
|
||||
corrected machine check the kernel can take different actions.
|
||||
Since machine check exceptions can happen any time it is sometimes
|
||||
risky for the kernel to kill a process because it defies
|
||||
normal kernel locking rules. The tolerance level configures
|
||||
how hard the kernel tries to recover even at some risk of
|
||||
deadlock. Higher tolerant values trade potentially better uptime
|
||||
with the risk of a crash or even corruption (for tolerant >= 3).
|
||||
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
|
||||
Default: 1
|
||||
|
||||
Note this only makes a difference if the CPU allows recovery
|
||||
from a machine check exception. Current x86 CPUs generally do not.
|
||||
|
||||
trigger
|
||||
Program to run when a machine check event is detected.
|
||||
This is an alternative to running mcelog regularly from cron
|
||||
and allows to detect events faster.
|
||||
monarch_timeout
|
||||
How long to wait for the other CPUs to machine check too on a
|
||||
exception. 0 to disable waiting for other CPUs.
|
||||
Unit: us
|
||||
The directory contains some configurable entries. See
|
||||
Documentation/ABI/testing/sysfs-mce for more details.
|
||||
|
||||
TBD document entries for AMD threshold interrupt configuration
|
||||
|
||||
|
@ -20353,6 +20353,8 @@ M: Tony Luck <tony.luck@intel.com>
|
||||
M: Borislav Petkov <bp@alien8.de>
|
||||
L: linux-edac@vger.kernel.org
|
||||
S: Maintained
|
||||
F: Documentation/ABI/testing/sysfs-mce
|
||||
F: Documentation/x86/x86_64/machinecheck.rst
|
||||
F: arch/x86/kernel/cpu/mce/*
|
||||
|
||||
X86 MICROCODE UPDATE SUPPORT
|
||||
|
Loading…
x
Reference in New Issue
Block a user