Currently unpoison_memory(unsigned long pfn) is designed for soft poison(hwpoison-inject) only. Since 17fae1294ad9d, the KPTE gets cleared on a x86 platform once hardware memory corrupts. Unpoisoning a hardware corrupted page puts page back buddy only, the kernel has a chance to access the page with *NOT PRESENT* KPTE. This leads BUG during accessing on the corrupted KPTE. Suggested by David&Naoya, disable unpoison mechanism when a real HW error happens to avoid BUG like this: Unpoison: Software-unpoisoned page 0x61234 BUG: unable to handle page fault for address: ffff888061234000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0.bm.1-amd64 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ... RIP: 0010:clear_page_erms+0x7/0x10 Code: ... RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000 RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000 RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001 FS: 00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> prep_new_page+0x151/0x170 get_page_from_freelist+0xca0/0xe20 ? sysvec_apic_timer_interrupt+0xab/0xc0 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 __alloc_pages+0x17e/0x340 __folio_alloc+0x17/0x40 vma_alloc_folio+0x84/0x280 __handle_mm_fault+0x8d4/0xeb0 handle_mm_fault+0xd5/0x2a0 do_user_addr_fault+0x1d0/0x680 ? kvm_read_and_reset_apf_flags+0x3b/0x50 exc_page_fault+0x78/0x170 asm_exc_page_fault+0x27/0x30 Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com Fixes: 847ce401df392 ("HWPOISON: Add unpoisoning support") Fixes: 17fae1294ad9d ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned") Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [5.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
185 lines
5.9 KiB
ReStructuredText
185 lines
5.9 KiB
ReStructuredText
.. hwpoison:
|
|
|
|
========
|
|
hwpoison
|
|
========
|
|
|
|
What is hwpoison?
|
|
=================
|
|
|
|
Upcoming Intel CPUs have support for recovering from some memory errors
|
|
(``MCA recovery``). This requires the OS to declare a page "poisoned",
|
|
kill the processes associated with it and avoid using it in the future.
|
|
|
|
This patchkit implements the necessary infrastructure in the VM.
|
|
|
|
To quote the overview comment::
|
|
|
|
High level machine check handler. Handles pages reported by the
|
|
hardware as being corrupted usually due to a 2bit ECC memory or cache
|
|
failure.
|
|
|
|
This focusses on pages detected as corrupted in the background.
|
|
When the current CPU tries to consume corruption the currently
|
|
running process can just be killed directly instead. This implies
|
|
that if the error cannot be handled for some reason it's safe to
|
|
just ignore it because no corruption has been consumed yet. Instead
|
|
when that happens another machine check will happen.
|
|
|
|
Handles page cache pages in various states. The tricky part
|
|
here is that we can access any page asynchronous to other VM
|
|
users, because memory failures could happen anytime and anywhere,
|
|
possibly violating some of their assumptions. This is why this code
|
|
has to be extremely careful. Generally it tries to use normal locking
|
|
rules, as in get the standard locks, even if that means the
|
|
error handling takes potentially a long time.
|
|
|
|
Some of the operations here are somewhat inefficient and have non
|
|
linear algorithmic complexity, because the data structures have not
|
|
been optimized for this case. This is in particular the case
|
|
for the mapping from a vma to a process. Since this case is expected
|
|
to be rare we hope we can get away with this.
|
|
|
|
The code consists of a the high level handler in mm/memory-failure.c,
|
|
a new page poison bit and various checks in the VM to handle poisoned
|
|
pages.
|
|
|
|
The main target right now is KVM guests, but it works for all kinds
|
|
of applications. KVM support requires a recent qemu-kvm release.
|
|
|
|
For the KVM use there was need for a new signal type so that
|
|
KVM can inject the machine check into the guest with the proper
|
|
address. This in theory allows other applications to handle
|
|
memory failures too. The expection is that near all applications
|
|
won't do that, but some very specialized ones might.
|
|
|
|
Failure recovery modes
|
|
======================
|
|
|
|
There are two (actually three) modes memory failure recovery can be in:
|
|
|
|
vm.memory_failure_recovery sysctl set to zero:
|
|
All memory failures cause a panic. Do not attempt recovery.
|
|
|
|
early kill
|
|
(can be controlled globally and per process)
|
|
Send SIGBUS to the application as soon as the error is detected
|
|
This allows applications who can process memory errors in a gentle
|
|
way (e.g. drop affected object)
|
|
This is the mode used by KVM qemu.
|
|
|
|
late kill
|
|
Send SIGBUS when the application runs into the corrupted page.
|
|
This is best for memory error unaware applications and default
|
|
Note some pages are always handled as late kill.
|
|
|
|
User control
|
|
============
|
|
|
|
vm.memory_failure_recovery
|
|
See sysctl.txt
|
|
|
|
vm.memory_failure_early_kill
|
|
Enable early kill mode globally
|
|
|
|
PR_MCE_KILL
|
|
Set early/late kill mode/revert to system default
|
|
|
|
arg1: PR_MCE_KILL_CLEAR:
|
|
Revert to system default
|
|
arg1: PR_MCE_KILL_SET:
|
|
arg2 defines thread specific mode
|
|
|
|
PR_MCE_KILL_EARLY:
|
|
Early kill
|
|
PR_MCE_KILL_LATE:
|
|
Late kill
|
|
PR_MCE_KILL_DEFAULT
|
|
Use system global default
|
|
|
|
Note that if you want to have a dedicated thread which handles
|
|
the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
|
|
call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
|
|
the SIGBUS is sent to the main thread.
|
|
|
|
PR_MCE_KILL_GET
|
|
return current mode
|
|
|
|
Testing
|
|
=======
|
|
|
|
* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
|
|
process for testing
|
|
|
|
* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
|
|
|
|
corrupt-pfn
|
|
Inject hwpoison fault at PFN echoed into this file. This does
|
|
some early filtering to avoid corrupted unintended pages in test suites.
|
|
|
|
unpoison-pfn
|
|
Software-unpoison page at PFN echoed into this file. This way
|
|
a page can be reused again. This only works for Linux
|
|
injected failures, not for real memory failures. Once any hardware
|
|
memory failure happens, this feature is disabled.
|
|
|
|
Note these injection interfaces are not stable and might change between
|
|
kernel versions
|
|
|
|
corrupt-filter-dev-major, corrupt-filter-dev-minor
|
|
Only handle memory failures to pages associated with the file
|
|
system defined by block device major/minor. -1U is the
|
|
wildcard value. This should be only used for testing with
|
|
artificial injection.
|
|
|
|
corrupt-filter-memcg
|
|
Limit injection to pages owned by memgroup. Specified by inode
|
|
number of the memcg.
|
|
|
|
Example::
|
|
|
|
mkdir /sys/fs/cgroup/mem/hwpoison
|
|
|
|
usemem -m 100 -s 1000 &
|
|
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
|
|
|
|
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
|
|
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
|
|
|
|
page-types -p `pidof init` --hwpoison # shall do nothing
|
|
page-types -p `pidof usemem` --hwpoison # poison its pages
|
|
|
|
corrupt-filter-flags-mask, corrupt-filter-flags-value
|
|
When specified, only poison pages if ((page_flags & mask) ==
|
|
value). This allows stress testing of many kinds of
|
|
pages. The page_flags are the same as in /proc/kpageflags. The
|
|
flag bits are defined in include/linux/kernel-page-flags.h and
|
|
documented in Documentation/admin-guide/mm/pagemap.rst
|
|
|
|
* Architecture specific MCE injector
|
|
|
|
x86 has mce-inject, mce-test
|
|
|
|
Some portable hwpoison test programs in mce-test, see below.
|
|
|
|
References
|
|
==========
|
|
|
|
http://halobates.de/mce-lc09-2.pdf
|
|
Overview presentation from LinuxCon 09
|
|
|
|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
|
|
Test suite (hwpoison specific portable tests in tsrc)
|
|
|
|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
|
|
x86 specific injector
|
|
|
|
|
|
Limitations
|
|
===========
|
|
- Not all page types are supported and never will. Most kernel internal
|
|
objects cannot be recovered, only LRU pages for now.
|
|
|
|
---
|
|
Andi Kleen, Oct 2009
|