8430557fc5
Allow page_table_check hooks to check over userfaultfd wr-protect criteria upon pgtable updates. The rule is no co-existance allowed for any writable flag against userfault wr-protect flag. This should be better than c2da319c2e, where we used to only sanitize such issues during a pgtable walk, but when hitting such issue we don't have a good chance to know where does that writable bit came from [1], so that even the pgtable walk exposes a kernel bug (which is still helpful on triaging) but not easy to track and debug. Now we switch to track the source. It's much easier too with the recent introduction of page table check. There are some limitations with using the page table check here for userfaultfd wr-protect purpose: - It is only enabled with explicit enablement of page table check configs and/or boot parameters, but should be good enough to track at least syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for x86 [1]. We used to have DEBUG_VM but it's now off for most distros, while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which is similar. - It conditionally works with the ptep_modify_prot API. It will be bypassed when e.g. XEN PV is enabled, however still work for most of the rest scenarios, which should be the common cases so should be good enough. - Hugetlb check is a bit hairy, as the page table check cannot identify hugetlb pte or normal pte via trapping at set_pte_at(), because of the current design where hugetlb maps every layers to pte_t... For example, the default set_huge_pte_at() can invoke set_pte_at() directly and lose the hugetlb context, treating it the same as a normal pte_t. So far it's fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as long as supported (x86 only). It'll be a bigger problem when we'll define _PAGE_UFFD_WP differently at various pgtable levels, because then one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now we can leave this for later too. This patch also removes commit c2da319c2e altogether, as we have something better now. [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/ Link: https://lkml.kernel.org/r/20240417212549.2766883-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
81 lines
3.7 KiB
ReStructuredText
81 lines
3.7 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
================
|
|
Page Table Check
|
|
================
|
|
|
|
Introduction
|
|
============
|
|
|
|
Page table check allows to harden the kernel by ensuring that some types of
|
|
the memory corruptions are prevented.
|
|
|
|
Page table check performs extra verifications at the time when new pages become
|
|
accessible from the userspace by getting their page table entries (PTEs PMDs
|
|
etc.) added into the table.
|
|
|
|
In case of most detected corruption, the kernel is crashed. There is a small
|
|
performance and memory overhead associated with the page table check. Therefore,
|
|
it is disabled by default, but can be optionally enabled on systems where the
|
|
extra hardening outweighs the performance costs. Also, because page table check
|
|
is synchronous, it can help with debugging double map memory corruption issues,
|
|
by crashing kernel at the time wrong mapping occurs instead of later which is
|
|
often the case with memory corruptions bugs.
|
|
|
|
It can also be used to do page table entry checks over various flags, dump
|
|
warnings when illegal combinations of entry flags are detected. Currently,
|
|
userfaultfd is the only user of such to sanity check wr-protect bit against
|
|
any writable flags. Illegal flag combinations will not directly cause data
|
|
corruption in this case immediately, but that will cause read-only data to
|
|
be writable, leading to corrupt when the page content is later modified.
|
|
|
|
Double mapping detection logic
|
|
==============================
|
|
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
| Current Mapping | New mapping | Permissions | Rule |
|
|
+===================+===================+===================+==================+
|
|
| Anonymous | Anonymous | Read | Allow |
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
| Anonymous | Anonymous | Read / Write | Prohibit |
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
| Anonymous | Named | Any | Prohibit |
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
| Named | Anonymous | Any | Prohibit |
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
| Named | Named | Any | Allow |
|
|
+-------------------+-------------------+-------------------+------------------+
|
|
|
|
Enabling Page Table Check
|
|
=========================
|
|
|
|
Build kernel with:
|
|
|
|
- PAGE_TABLE_CHECK=y
|
|
Note, it can only be enabled on platforms where ARCH_SUPPORTS_PAGE_TABLE_CHECK
|
|
is available.
|
|
|
|
- Boot with 'page_table_check=on' kernel parameter.
|
|
|
|
Optionally, build kernel with PAGE_TABLE_CHECK_ENFORCED in order to have page
|
|
table support without extra kernel parameter.
|
|
|
|
Implementation notes
|
|
====================
|
|
|
|
We specifically decided not to use VMA information in order to avoid relying on
|
|
MM states (except for limited "struct page" info). The page table check is a
|
|
separate from Linux-MM state machine that verifies that the user accessible
|
|
pages are not falsely shared.
|
|
|
|
PAGE_TABLE_CHECK depends on EXCLUSIVE_SYSTEM_RAM. The reason is that without
|
|
EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary physical memory
|
|
regions into the userspace via /dev/mem. At the same time, pages may change
|
|
their properties (e.g., from anonymous pages to named pages) while they are
|
|
still being mapped in the userspace, leading to "corruption" detected by the
|
|
page table check.
|
|
|
|
Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be mapped via
|
|
/dev/mem. However, these pages are always considered as named pages, so they
|
|
won't break the logic used in the page table check.
|